Caption-Matching: A Multimodal Approach for Cross-Domain Image Retrieval

Lucas Iijima; Nikolaos Giakoumoglou; Tania Stathaki

arxiv: 2403.15152 · v3 · submitted 2024-03-22 · 💻 cs.CV

Caption-Matching: A Multimodal Approach for Cross-Domain Image Retrieval

Lucas Iijima , Nikolaos Giakoumoglou , Tania Stathaki This is my paper

Pith reviewed 2026-05-24 03:06 UTC · model grok-4.3

classification 💻 cs.CV

keywords cross-domain image retrievalcaption matchingvision-language modelszero-shot retrievalunsupervised domain adaptationmultimodal similarityplug-and-play retrieval

0 comments

The pith

Generated image captions act as a domain-agnostic bridge for retrieving matches across sketches, paintings, and photographs without labels or training.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents Caption-Matching as a way to turn images from mismatched visual domains into text descriptions using existing pre-trained models. These descriptions then serve as the basis for computing similarity between images that otherwise look unrelated due to style differences. The approach requires no labeled cross-domain pairs and no additional training on target data, which removes common barriers in cross-domain retrieval. Results on Office-Home and DomainNet show consistent gains over earlier methods, and the same pipeline handles queries that mix real and AI-generated images. A reader would care because it offers a lightweight alternative to supervised domain adaptation for tasks where collecting matched examples is impractical.

Core claim

Caption-Matching generates captions for each image with pre-trained vision-language models and treats those captions as an intermediate representation that remains stable across visual domains. Similarity between images from different domains is then computed directly on the captions, which enables retrieval without any requirement for labeled cross-domain correspondences or model fine-tuning. The method records state-of-the-art performance in plug-and-play settings on Office-Home and DomainNet while also succeeding on queries that combine real photographs with AI-generated images from Midjourney.

What carries the argument

Caption-Matching (CM), which converts each image into a caption via a pre-trained vision-language model and uses caption similarity for cross-domain matching.

If this is right

Cross-domain retrieval works in a fully unsupervised plug-and-play manner on standard benchmarks.
Performance improves on Office-Home and DomainNet without domain-specific training or labeled pairs.
The same caption-based matching extends to queries that mix multiple visual domains including AI-generated images.
No requirement for target-domain labels lowers the data cost of deploying retrieval systems across new styles.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same caption intermediary could be tested on video frames or 3D renders to check whether language still bridges those modality gaps.
Replacing the caption generator with a stronger model would be a direct way to measure how much retrieval quality depends on caption fidelity.
Domains that produce highly ambiguous captions, such as abstract art, would expose where the method's performance ceiling lies.
Caption matching could simplify retrieval pipelines that already use language as a common space for other modalities.

Load-bearing premise

Captions produced by current pre-trained vision-language models retain enough semantic detail to serve as a reliable domain-invariant signal for similarity computation.

What would settle it

A drop in retrieval accuracy on domain pairs where the same visual content receives markedly different captions from the model, or where visually dissimilar content receives similar captions.

Figures

Figures reproduced from arXiv: 2403.15152 by Lucas Iijima, Nikolaos Giakoumoglou, Tania Stathaki.

**Figure 2.** Figure 2: t-SNE visualization of text embeddings generated [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Qualitative comparison of the top-10 retrieved [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 4.** Figure 4: Top-5 retrieval results on Midjourney’s database. [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

read the original abstract

Cross-Domain Image Retrieval (CDIR) is a challenging task in computer vision, aiming to match images across different visual domains such as sketches, paintings, and photographs. Existing CDIR methods rely either on supervised learning with labeled cross-domain correspondences or on methods that require training or fine-tuning on target datasets, often struggling with substantial domain gaps and limited generalization to unseen domains. This paper introduces a novel CDIR approach that incorporates textual context by leveraging publicly available pre-trained vision-language models. Our method, Caption-Matching (CM), uses generated image captions as a domain-agnostic intermediate representation, enabling effective cross-domain similarity computation without the need for labeled data or further training. We evaluate our method on standard CDIR benchmark datasets, demonstrating state-of-the-art performance in plug-and-play settings with consistent improvements on Office-Home and DomainNet over previous methods. We also demonstrate our method's effectiveness on a dataset of AI-generated images from Midjourney, showcasing its ability to handle complex, multi-domain queries.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The caption-matching trick is a clean no-train baseline for CDIR but the results rest on whether photo-trained VLMs actually produce usable captions for sketches and art.

read the letter

The one thing to know is that this paper turns cross-domain retrieval into a caption generation plus text matching problem, claiming it works without any training or labeled data. The new part is using off-the-shelf vision-language models to create a domain-agnostic text representation for matching images from photos, sketches, and paintings. It does well by showing results on standard benchmarks like Office-Home and DomainNet plus an extra test on AI-generated images, and by keeping the approach training-free. The soft spots are around the caption step itself. The method assumes that captions from models trained mostly on photos will capture the right semantics for sketches and art without losing key details. That assumption is not obviously true, and if the captions become generic or wrong on those domains, the text similarity can't fix it. The abstract does not include any numbers, ablations, or checks on caption accuracy per domain, which makes the SOTA claim difficult to evaluate from the summary alone. If the full paper has those controls, it would help. This paper is for people in computer vision who need a quick, label-free way to handle domain shifts in retrieval tasks. A reader who already works with VLMs would find it easy to implement and test. I would recommend sending it for peer review because the core pipeline is clear and the datasets are standard, even though the caption robustness needs more scrutiny in revision.

Referee Report

3 major / 2 minor

Summary. The paper proposes Caption-Matching (CM), a training-free cross-domain image retrieval method that generates captions from pre-trained vision-language models (e.g., BLIP, LLaVA) and performs similarity search in caption embedding space. It claims this yields state-of-the-art results on Office-Home and DomainNet without labeled data or fine-tuning, plus results on Midjourney AI-generated images.

Significance. If the central claim holds, the work would demonstrate a simple plug-and-play zero-shot CDIR approach that sidesteps supervised domain adaptation. The significance is tempered by the absence of reported quantitative metrics, ablations, or caption-fidelity analysis in the abstract and by the method's dependence on external pre-trained VLMs whose behavior on non-photographic domains is uncharacterized.

major comments (3)

[Abstract, §4] Abstract and §4 (Experiments): The abstract asserts SOTA performance on Office-Home and DomainNet yet supplies no numerical results (e.g., mAP, recall@K), baseline comparisons, or ablation tables; without these data the central empirical claim cannot be evaluated.
[§3] §3 (Method): The domain-agnostic claim rests on the untested assumption that captions from photo-centric VLMs preserve fine-grained semantics on sketches, paintings, and clip-art; no per-domain caption accuracy, hallucination rate, or fidelity metrics are provided to support this.
[§4] §4: No error analysis or controlled experiment isolates whether retrieval gains survive when caption quality is degraded (e.g., by substituting generic or incorrect captions), leaving the weakest assumption unexamined.

minor comments (2)

[§3.1] Notation for the text encoder and similarity function should be introduced with explicit equations rather than prose descriptions.
[§4.3] The Midjourney dataset description lacks details on query construction and domain composition.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment point-by-point below and indicate the revisions planned for the next version of the manuscript.

read point-by-point responses

Referee: [Abstract, §4] Abstract and §4 (Experiments): The abstract asserts SOTA performance on Office-Home and DomainNet yet supplies no numerical results (e.g., mAP, recall@K), baseline comparisons, or ablation tables; without these data the central empirical claim cannot be evaluated.

Authors: We agree that the abstract would be strengthened by including key numerical results. In the revised manuscript we will update the abstract to report the main mAP and Recall@K figures on Office-Home and DomainNet together with the primary baseline comparisons. The full tables already appear in §4; the change is limited to surfacing the headline numbers in the abstract. revision: yes
Referee: [§3] §3 (Method): The domain-agnostic claim rests on the untested assumption that captions from photo-centric VLMs preserve fine-grained semantics on sketches, paintings, and clip-art; no per-domain caption accuracy, hallucination rate, or fidelity metrics are provided to support this.

Authors: The manuscript presents the domain-agnostic claim through consistent retrieval gains across all domains in the benchmarks. While we do not supply explicit per-domain caption fidelity or hallucination statistics, the end-to-end results serve as indirect validation. To address the request directly we will add, in the revision, a short qualitative caption analysis across domains and any available quantitative fidelity numbers on annotated subsets. revision: partial
Referee: [§4] §4: No error analysis or controlled experiment isolates whether retrieval gains survive when caption quality is degraded (e.g., by substituting generic or incorrect captions), leaving the weakest assumption unexamined.

Authors: We concur that an explicit ablation on caption quality would be informative. In the revised §4 we will insert a controlled experiment that replaces the generated captions with generic or random text and reports the resulting drop in retrieval metrics, thereby isolating the contribution of caption semantics. revision: yes

Circularity Check

0 steps flagged

No circularity; retrieval performance depends on external pre-trained VLMs

full rationale

The paper introduces Caption-Matching by generating captions from off-the-shelf vision-language models (BLIP, LLaVA, etc.) and then computing text similarity. No equations, fitted parameters, or self-citations appear in the provided text. The central claim does not reduce any quantity to a definition or fit internal to the work; performance is explicitly attributed to the external models' semantic preservation, which is an assumption but not a circular derivation. This matches the default expectation of a self-contained method whose validity rests on external components.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain-agnostic property of captions and the semantic fidelity of off-the-shelf VLMs; no free parameters, ad-hoc axioms, or new entities are introduced in the abstract.

axioms (1)

domain assumption Pre-trained vision-language models produce captions that capture semantic content independent of visual domain style.
Invoked when the paper states that captions serve as a domain-agnostic intermediate representation.

pith-pipeline@v0.9.0 · 5708 in / 1182 out tokens · 43763 ms · 2026-05-24T03:06:07.526956+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

51 extracted references · 51 canonical work pages · 3 internal anchors

[1]

Image retrieval: Ideas, influences, and trends of the new age,

R. Datta, D. Joshi, J. Li, and J. Z. Wang, “Image retrieval: Ideas, influences, and trends of the new age,” ACM Computing Surveys (Csur), vol. 40, no. 2, pp. 1–60, 2008

work page 2008
[2]

Vqa: Visual question answering,

S. Antol, A. Agrawal, J. Lu, M. Mitchell, D. Batra, C. Lawrence Zit- nick, and D. Parikh, “Vqa: Visual question answering,” in 2015 IEEE International Conference on Computer Vision (ICCV) , 2015, pp. 2425–2433

work page 2015
[3]

Learning fine-grained image similarity with deep ranking,

J. Wang, Y . Song, T. Leung, C. Rosenberg, J. Wang, J. Philbin, B. Chen, and Y . Wu, “Learning fine-grained image similarity with deep ranking,” in 2014 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2014, pp. 1386–1393

work page 2014
[4]

Adversarial cross-modal retrieval,

B. Wang, Y . Yang, X. Xu, A. Hanjalic, and H. T. Shen, “Adversarial cross-modal retrieval,” in Proceedings of the 25th ACM International Conference on Multimedia , 2017, pp. 154–162. 7

work page 2017
[5]

Domain Adaptation for Visual Applications: A Comprehensive Survey

G. Csurka, “Domain adaptation for visual applications: A compre- hensive survey,” arXiv preprint arXiv:1702.05374 , 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[6]

The sketchy database: learning to retrieve badly drawn bunnies,

P. Sangkloy, N. Burnell, C. Ham, and J. Hays, “The sketchy database: learning to retrieve badly drawn bunnies,” in ACM SIGGRAPH 2016 Papers, 2016, pp. 1–12

work page 2016
[7]

Sketch me that shoe,

Q. Yu, F. Liu, Y .-Z. Song, T. Xiang, T. M. Hospedales, and C.-C. Loy, “Sketch me that shoe,” in 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , 2016, pp. 799–807

work page 2016
[8]

Deep spatial-semantic attention for fine-grained sketch-based image retrieval,

J. Song, Q. Yu, Y .-Z. Song, T. Xiang, and T. M. Hospedales, “Deep spatial-semantic attention for fine-grained sketch-based image retrieval,” in2017 IEEE International Conference on Computer Vision (ICCV), 2017, pp. 5551–5560

work page 2017
[9]

Deep learning for generic object detection: A sur- vey,

L. Liu, W. Ouyang, X. Wang, P. Fieguth, J. Chen, X. Liu, and M. Pietik ¨ainen, “Deep learning for generic object detection: A sur- vey,” International Journal of Computer Vision , vol. 128, pp. 261– 318, 2020

work page 2020
[10]

Adversarial discriminative domain adaptation,

E. Tzeng, J. Hoffman, K. Saenko, and T. Darrell, “Adversarial discriminative domain adaptation,” in 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , 2017, pp. 2962– 2971

work page 2017
[11]

Domain-adversarial training of neural networks,

Y . Ganin, E. Ustinova, H. Ajakan, P. Germain, H. Larochelle, F. Lavi- olette, M. Marchand, and V . Lempitsky, “Domain-adversarial training of neural networks,” in Domain Adaptation in Computer Vision Applications. Springer, 2017, pp. 189–209

work page 2017
[12]

Moment matching for multi-source domain adaptation,

X. Peng, Q. Bai, X. Xia, Z. Huang, K. Saenko, and B. Wang, “Moment matching for multi-source domain adaptation,” in 2019 IEEE/CVF International Conference on Computer Vision (ICCV) , 2019, pp. 1406–1415

work page 2019
[13]

Learning transferable visual models from natural lan- guage supervision,

A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agar- wal, G. Sastry, A. Askell, P. Mishkin, J. Clark, G. Krueger, and I. Sutskever, “Learning transferable visual models from natural lan- guage supervision,” in International Conference on Machine Learn- ing. PMLR, 2021, pp. 8748–8763

work page 2021
[14]

Scaling up visual and vision-language representation learning with noisy text supervision,

C. Jia et al. , “Scaling up visual and vision-language representation learning with noisy text supervision,” in International Conference on Machine Learning. PMLR, 2021, pp. 4904–4916

work page 2021
[15]

BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models

J. Li, D. Li, S. Savarese, and S. Hoi, “Blip-2: Bootstrapping language- image pre-training with frozen image encoders and large language models,” arXiv preprint arXiv:2301.12597 , 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[16]

Ob- ject retrieval with large vocabularies and fast spatial matching,

J. Philbin, O. Chum, M. Isard, J. Sivic, and A. Zisserman, “Ob- ject retrieval with large vocabularies and fast spatial matching,” in 2007 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, 2007, pp. 1–8

work page 2007
[17]

Hamming embedding and weak geometric consistency for large scale image search,

H. Jegou, M. Douze, and C. Schmid, “Hamming embedding and weak geometric consistency for large scale image search,” in European Conference on Computer Vision (ECCV) . Springer, 2008, pp. 304– 317

work page 2008
[18]

Large-scale image retrieval with attentive deep local features,

H. Noh, A. Araujo, J. Sim, T. Weyand, and B. Han, “Large-scale image retrieval with attentive deep local features,” in 2017 IEEE International Conference on Computer Vision (ICCV) , 2017, pp. 3456–3465

work page 2017
[19]

Fine-tuning cnn image retrieval with no human annotation,

F. Radenovi ´c, G. Tolias, and O. Chum, “Fine-tuning cnn image retrieval with no human annotation,” IEEE Transactions on Pattern Analysis and Machine Intelligence , vol. 41, no. 7, pp. 1655–1668, 2019

work page 2019
[20]

Efficient image retrieval via decoupling diffusion into online and offline processing,

F. Yang, R. Hinami, Y . Matsui, S. Ly, and S. Satoh, “Efficient image retrieval via decoupling diffusion into online and offline processing,” in Proceedings of the AAAI Conference on Artificial Intelligence , vol. 33, 2019

work page 2019
[21]

Solar: Second-order loss and attention for image retrieval,

T. Ng, V . Balntas, Y . Tian, and K. Mikolajczyk, “Solar: Second-order loss and attention for image retrieval,” in European Conference on Computer Vision (ECCV) . Springer, 2020, pp. 253–270

work page 2020
[22]

Dolg: Single-stage image retrieval with deep orthogonal fusion of local and global features,

M. Yang, D. He, M. Fan, B. Shi, X. Xue, F. Li, E. Ding, and J. Huang, “Dolg: Single-stage image retrieval with deep orthogonal fusion of local and global features,” in 2021 IEEE/CVF International Conference on Computer Vision (ICCV) , 2021, pp. 11 772–11 781

work page 2021
[23]

Revisiting self-similarity: Structural embedding for image retrieval,

S. Lee, S. Lee, H. Seong, and E. Kim, “Revisiting self-similarity: Structural embedding for image retrieval,” in 2023 IEEE/CVF Con- ference on Computer Vision and Pattern Recognition (CVPR) , 2023, pp. 23 412–23 421

work page 2023
[24]

Cross-domain image retrieval with a dual attribute-aware ranking network,

J. Huang, R. S. Feris, Q. Chen, and S. Yan, “Cross-domain image retrieval with a dual attribute-aware ranking network,” in 2015 IEEE International Conference on Computer Vision (ICCV) , 2015, pp. 1062–1070

work page 2015
[25]

Cds: Cross-domain self-supervised pre-training,

D. Kim, K. Saito, T.-H. Oh, B. A. Plummer, S. Sclaroff, and K. Saenko, “Cds: Cross-domain self-supervised pre-training,” in 2021 IEEE/CVF International Conference on Computer Vision (ICCV) , 2021, pp. 9123–9132

work page 2021
[26]

Learning to prompt for vision-language models,

K. Zhou, J. Yang, C. C. Loy, and Z. Liu, “Learning to prompt for vision-language models,” International Journal of Computer Vision , vol. 130, pp. 2337–2348, 2022

work page 2022
[27]

Clip models are few-shot learners: Empirical studies on vqa and visual entailment,

H. Song, L. Dong, W. Zhang, T. Liu, and F. Wei, “Clip models are few-shot learners: Empirical studies on vqa and visual entailment,” in Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics, 2022, pp. 6088–6100

work page 2022
[28]

Flamingo: a visual language model for few- shot learning,

J.-B. Alayrac et al. , “Flamingo: a visual language model for few- shot learning,” Advances in Neural Information Processing Systems , vol. 35, pp. 23 716–23 736, 2022

work page 2022
[29]

Blip: Bootstrapping language- image pre-training for unified vision-language understanding and gen- eration,

J. Li, D. Li, C. Xiong, and S. Hoi, “Blip: Bootstrapping language- image pre-training for unified vision-language understanding and gen- eration,” in International Conference on Machine Learning . PMLR, 2022, pp. 12 888–12 900

work page 2022
[30]

Flava: A foundational language and vision alignment model,

A. Singh, R. Hu, V . Goswami, G. Couairon, W. Galuba, M. Rohrbach, and D. Kiela, “Flava: A foundational language and vision alignment model,” in 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022, pp. 15 638–15 650

work page 2022
[31]

Fine-grained image retrieval: the text/sketch input dilemma,

J. Song, Y .-Z. Song, T. Xiang, and T. Hospedales, “Fine-grained image retrieval: the text/sketch input dilemma,” in Proceedings of the British Machine Vision Conference (BMVC) , 2017, pp. 45.1–45.12

work page 2017
[32]

A sketch is worth a thousand words: Image retrieval with text and sketch,

P. Sangkloy, W. Jitkrittum, D. Yang, and J. Hays, “A sketch is worth a thousand words: Image retrieval with text and sketch,” in European Conference on Computer Vision (ECCV) , 2022, pp. 251–267

work page 2022
[33]

An image is worth 16x16 words: Transformers for image recognition at scale,

A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, J. Uszkoreit, and N. Houlsby, “An image is worth 16x16 words: Transformers for image recognition at scale,” 2021

work page 2021
[34]

Language models are unsupervised multitask learners,

A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, and I. Sutskever, “Language models are unsupervised multitask learners,”

work page
[35]

Available: https://api.semanticscholar.org/CorpusID: 160025533

[Online]. Available: https://api.semanticscholar.org/CorpusID: 160025533

work page
[36]

Visualizing data using t-sne,

L. van der Maaten and G. Hinton, “Visualizing data using t-sne,” Journal of Machine Learning Research , vol. 9, no. 86, pp. 2579–2605, 2008. [Online]. Available: http://jmlr.org/papers/v9/van dermaaten08a.html

work page 2008
[37]

Eva: Exploring the limits of masked visual representation learning at scale,

Y . Fang, W. Wang, B. Xie, Q. Sun, L. Wu, X. Wang, T. Huang, X. Wang, and Y . Cao, “Eva: Exploring the limits of masked visual representation learning at scale,” in 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , 2023

work page 2023
[38]

OPT: Open Pre-trained Transformer Language Models

S. Zhang et al. , “Opt: Open pre-trained transformer language models,” 2022. [Online]. Available: https://arxiv.org/abs/2205.01068

work page internal anchor Pith review Pith/arXiv arXiv 2022
[39]

Decoupled weight decay regularization,

I. Loshchilov and F. Hutter, “Decoupled weight decay regularization,” in International Conference on Learning Representations , 2019

work page 2019
[40]

Microsoft coco: Common objects in context,

T.-Y . Lin, M. Maire, S. Belongie, L. Bourdev, R. Girshick, J. Hays, P. Perona, D. Ramanan, C. L. Zitnick, and P. Doll´ar, “Microsoft coco: Common objects in context,” in European Conference on Computer Vision (ECCV), 2014. 8

work page 2014
[41]

Visual genome: Connecting language and vision using crowdsourced dense image annotations,

R. Krishna et al. , “Visual genome: Connecting language and vision using crowdsourced dense image annotations,” International Journal of Computer Vision , p. 32–73, 2017

work page 2017
[42]

Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning,

P. Sharma, N. Ding, S. Goodman, and R. Soricut, “Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning,” in Proceedings of the Association for Computa- tional Linguistics, 2018, pp. 2556–2565

work page 2018
[43]

Conceptual 12m: Pushing web-scale image-text pre-training to recognize long- tail visual concepts,

S. Changpinyo, P. Sharma, N. Ding, and R. Soricut, “Conceptual 12m: Pushing web-scale image-text pre-training to recognize long- tail visual concepts,” in 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , 2021, pp. 3557–3567

work page 2021
[44]

Im2text: Describing images using 1 million captioned photographs,

V . Ordonez, G. Kulkarni, and T. Berg, “Im2text: Describing images using 1 million captioned photographs,” in Advances in Neural In- formation Processing Systems, J. Shawe-Taylor, R. Zemel, P. Bartlett, F. Pereira, and K. Weinberger, Eds., vol. 24. Curran Associates, Inc., 2011

work page 2011
[45]

Laion-400m: Open dataset of clip-filtered 400 million image-text pairs,

C. Schuhmann, R. Vencu, R. Beaumont, R. Kaczmarczyk, C. Mullis, A. Katta, T. Coombes, J. Jitsev, and A. Komatsuzaki, “Laion-400m: Open dataset of clip-filtered 400 million image-text pairs,” 2021

work page 2021
[46]

Laion-5b: An open large-scale dataset for training next generation image-text models,

C. Schuhmann et al. , “Laion-5b: An open large-scale dataset for training next generation image-text models,” in Advances in Neural Information Processing Systems , 2022, pp. 25 278–25 294

work page 2022
[47]

Deep hashing network for unsupervised domain adaptation,

H. Venkateswara, J. Eusebio, S. Chakraborty, and S. Panchanathan, “Deep hashing network for unsupervised domain adaptation,” in 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017, pp. 5018–5027

work page 2017
[48]

Prototypical contrastive learning of unsupervised representations,

J. Li, P. Zhou, C. Xiong, and S. Hoi, “Prototypical contrastive learning of unsupervised representations,” inInternational Conference on Learning Representations (ICLR) , 2021

work page 2021
[49]

Prototypical cross-domain self-supervised learning for few-shot unsupervised domain adaptation,

X. Yue et al., “Prototypical cross-domain self-supervised learning for few-shot unsupervised domain adaptation,” in 2021 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , 2021

work page 2021
[50]

Feature representation learning for unsu- pervised cross-domain image retrieval,

C. Hu and G. H. Lee, “Feature representation learning for unsu- pervised cross-domain image retrieval,” in European Conference on Computer Vision (ECCV) , 2022

work page 2022
[51]

Correspondence-free domain alignment for unsupervised cross-domain image retrieval,

X. Wang, D. Peng, M. Yan, and P. Hu, “Correspondence-free domain alignment for unsupervised cross-domain image retrieval,” Proceed- ings of the AAAI Conference on Artificial Intelligence , vol. 37, no. 8, pp. 10 200–10 208, 2023. 9

work page 2023

[1] [1]

Image retrieval: Ideas, influences, and trends of the new age,

R. Datta, D. Joshi, J. Li, and J. Z. Wang, “Image retrieval: Ideas, influences, and trends of the new age,” ACM Computing Surveys (Csur), vol. 40, no. 2, pp. 1–60, 2008

work page 2008

[2] [2]

Vqa: Visual question answering,

S. Antol, A. Agrawal, J. Lu, M. Mitchell, D. Batra, C. Lawrence Zit- nick, and D. Parikh, “Vqa: Visual question answering,” in 2015 IEEE International Conference on Computer Vision (ICCV) , 2015, pp. 2425–2433

work page 2015

[3] [3]

Learning fine-grained image similarity with deep ranking,

J. Wang, Y . Song, T. Leung, C. Rosenberg, J. Wang, J. Philbin, B. Chen, and Y . Wu, “Learning fine-grained image similarity with deep ranking,” in 2014 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2014, pp. 1386–1393

work page 2014

[4] [4]

Adversarial cross-modal retrieval,

B. Wang, Y . Yang, X. Xu, A. Hanjalic, and H. T. Shen, “Adversarial cross-modal retrieval,” in Proceedings of the 25th ACM International Conference on Multimedia , 2017, pp. 154–162. 7

work page 2017

[5] [5]

Domain Adaptation for Visual Applications: A Comprehensive Survey

G. Csurka, “Domain adaptation for visual applications: A compre- hensive survey,” arXiv preprint arXiv:1702.05374 , 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[6] [6]

The sketchy database: learning to retrieve badly drawn bunnies,

P. Sangkloy, N. Burnell, C. Ham, and J. Hays, “The sketchy database: learning to retrieve badly drawn bunnies,” in ACM SIGGRAPH 2016 Papers, 2016, pp. 1–12

work page 2016

[7] [7]

Sketch me that shoe,

Q. Yu, F. Liu, Y .-Z. Song, T. Xiang, T. M. Hospedales, and C.-C. Loy, “Sketch me that shoe,” in 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , 2016, pp. 799–807

work page 2016

[8] [8]

Deep spatial-semantic attention for fine-grained sketch-based image retrieval,

J. Song, Q. Yu, Y .-Z. Song, T. Xiang, and T. M. Hospedales, “Deep spatial-semantic attention for fine-grained sketch-based image retrieval,” in2017 IEEE International Conference on Computer Vision (ICCV), 2017, pp. 5551–5560

work page 2017

[9] [9]

Deep learning for generic object detection: A sur- vey,

L. Liu, W. Ouyang, X. Wang, P. Fieguth, J. Chen, X. Liu, and M. Pietik ¨ainen, “Deep learning for generic object detection: A sur- vey,” International Journal of Computer Vision , vol. 128, pp. 261– 318, 2020

work page 2020

[10] [10]

Adversarial discriminative domain adaptation,

E. Tzeng, J. Hoffman, K. Saenko, and T. Darrell, “Adversarial discriminative domain adaptation,” in 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , 2017, pp. 2962– 2971

work page 2017

[11] [11]

Domain-adversarial training of neural networks,

Y . Ganin, E. Ustinova, H. Ajakan, P. Germain, H. Larochelle, F. Lavi- olette, M. Marchand, and V . Lempitsky, “Domain-adversarial training of neural networks,” in Domain Adaptation in Computer Vision Applications. Springer, 2017, pp. 189–209

work page 2017

[12] [12]

Moment matching for multi-source domain adaptation,

X. Peng, Q. Bai, X. Xia, Z. Huang, K. Saenko, and B. Wang, “Moment matching for multi-source domain adaptation,” in 2019 IEEE/CVF International Conference on Computer Vision (ICCV) , 2019, pp. 1406–1415

work page 2019

[13] [13]

Learning transferable visual models from natural lan- guage supervision,

A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agar- wal, G. Sastry, A. Askell, P. Mishkin, J. Clark, G. Krueger, and I. Sutskever, “Learning transferable visual models from natural lan- guage supervision,” in International Conference on Machine Learn- ing. PMLR, 2021, pp. 8748–8763

work page 2021

[14] [14]

Scaling up visual and vision-language representation learning with noisy text supervision,

C. Jia et al. , “Scaling up visual and vision-language representation learning with noisy text supervision,” in International Conference on Machine Learning. PMLR, 2021, pp. 4904–4916

work page 2021

[15] [15]

BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models

J. Li, D. Li, S. Savarese, and S. Hoi, “Blip-2: Bootstrapping language- image pre-training with frozen image encoders and large language models,” arXiv preprint arXiv:2301.12597 , 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[16] [16]

Ob- ject retrieval with large vocabularies and fast spatial matching,

J. Philbin, O. Chum, M. Isard, J. Sivic, and A. Zisserman, “Ob- ject retrieval with large vocabularies and fast spatial matching,” in 2007 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, 2007, pp. 1–8

work page 2007

[17] [17]

Hamming embedding and weak geometric consistency for large scale image search,

H. Jegou, M. Douze, and C. Schmid, “Hamming embedding and weak geometric consistency for large scale image search,” in European Conference on Computer Vision (ECCV) . Springer, 2008, pp. 304– 317

work page 2008

[18] [18]

Large-scale image retrieval with attentive deep local features,

H. Noh, A. Araujo, J. Sim, T. Weyand, and B. Han, “Large-scale image retrieval with attentive deep local features,” in 2017 IEEE International Conference on Computer Vision (ICCV) , 2017, pp. 3456–3465

work page 2017

[19] [19]

Fine-tuning cnn image retrieval with no human annotation,

F. Radenovi ´c, G. Tolias, and O. Chum, “Fine-tuning cnn image retrieval with no human annotation,” IEEE Transactions on Pattern Analysis and Machine Intelligence , vol. 41, no. 7, pp. 1655–1668, 2019

work page 2019

[20] [20]

Efficient image retrieval via decoupling diffusion into online and offline processing,

F. Yang, R. Hinami, Y . Matsui, S. Ly, and S. Satoh, “Efficient image retrieval via decoupling diffusion into online and offline processing,” in Proceedings of the AAAI Conference on Artificial Intelligence , vol. 33, 2019

work page 2019

[21] [21]

Solar: Second-order loss and attention for image retrieval,

T. Ng, V . Balntas, Y . Tian, and K. Mikolajczyk, “Solar: Second-order loss and attention for image retrieval,” in European Conference on Computer Vision (ECCV) . Springer, 2020, pp. 253–270

work page 2020

[22] [22]

Dolg: Single-stage image retrieval with deep orthogonal fusion of local and global features,

M. Yang, D. He, M. Fan, B. Shi, X. Xue, F. Li, E. Ding, and J. Huang, “Dolg: Single-stage image retrieval with deep orthogonal fusion of local and global features,” in 2021 IEEE/CVF International Conference on Computer Vision (ICCV) , 2021, pp. 11 772–11 781

work page 2021

[23] [23]

Revisiting self-similarity: Structural embedding for image retrieval,

S. Lee, S. Lee, H. Seong, and E. Kim, “Revisiting self-similarity: Structural embedding for image retrieval,” in 2023 IEEE/CVF Con- ference on Computer Vision and Pattern Recognition (CVPR) , 2023, pp. 23 412–23 421

work page 2023

[24] [24]

Cross-domain image retrieval with a dual attribute-aware ranking network,

J. Huang, R. S. Feris, Q. Chen, and S. Yan, “Cross-domain image retrieval with a dual attribute-aware ranking network,” in 2015 IEEE International Conference on Computer Vision (ICCV) , 2015, pp. 1062–1070

work page 2015

[25] [25]

Cds: Cross-domain self-supervised pre-training,

D. Kim, K. Saito, T.-H. Oh, B. A. Plummer, S. Sclaroff, and K. Saenko, “Cds: Cross-domain self-supervised pre-training,” in 2021 IEEE/CVF International Conference on Computer Vision (ICCV) , 2021, pp. 9123–9132

work page 2021

[26] [26]

Learning to prompt for vision-language models,

K. Zhou, J. Yang, C. C. Loy, and Z. Liu, “Learning to prompt for vision-language models,” International Journal of Computer Vision , vol. 130, pp. 2337–2348, 2022

work page 2022

[27] [27]

Clip models are few-shot learners: Empirical studies on vqa and visual entailment,

H. Song, L. Dong, W. Zhang, T. Liu, and F. Wei, “Clip models are few-shot learners: Empirical studies on vqa and visual entailment,” in Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics, 2022, pp. 6088–6100

work page 2022

[28] [28]

Flamingo: a visual language model for few- shot learning,

J.-B. Alayrac et al. , “Flamingo: a visual language model for few- shot learning,” Advances in Neural Information Processing Systems , vol. 35, pp. 23 716–23 736, 2022

work page 2022

[29] [29]

Blip: Bootstrapping language- image pre-training for unified vision-language understanding and gen- eration,

J. Li, D. Li, C. Xiong, and S. Hoi, “Blip: Bootstrapping language- image pre-training for unified vision-language understanding and gen- eration,” in International Conference on Machine Learning . PMLR, 2022, pp. 12 888–12 900

work page 2022

[30] [30]

Flava: A foundational language and vision alignment model,

A. Singh, R. Hu, V . Goswami, G. Couairon, W. Galuba, M. Rohrbach, and D. Kiela, “Flava: A foundational language and vision alignment model,” in 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022, pp. 15 638–15 650

work page 2022

[31] [31]

Fine-grained image retrieval: the text/sketch input dilemma,

J. Song, Y .-Z. Song, T. Xiang, and T. Hospedales, “Fine-grained image retrieval: the text/sketch input dilemma,” in Proceedings of the British Machine Vision Conference (BMVC) , 2017, pp. 45.1–45.12

work page 2017

[32] [32]

A sketch is worth a thousand words: Image retrieval with text and sketch,

P. Sangkloy, W. Jitkrittum, D. Yang, and J. Hays, “A sketch is worth a thousand words: Image retrieval with text and sketch,” in European Conference on Computer Vision (ECCV) , 2022, pp. 251–267

work page 2022

[33] [33]

An image is worth 16x16 words: Transformers for image recognition at scale,

A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, J. Uszkoreit, and N. Houlsby, “An image is worth 16x16 words: Transformers for image recognition at scale,” 2021

work page 2021

[34] [34]

Language models are unsupervised multitask learners,

A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, and I. Sutskever, “Language models are unsupervised multitask learners,”

work page

[35] [35]

Available: https://api.semanticscholar.org/CorpusID: 160025533

[Online]. Available: https://api.semanticscholar.org/CorpusID: 160025533

work page

[36] [36]

Visualizing data using t-sne,

L. van der Maaten and G. Hinton, “Visualizing data using t-sne,” Journal of Machine Learning Research , vol. 9, no. 86, pp. 2579–2605, 2008. [Online]. Available: http://jmlr.org/papers/v9/van dermaaten08a.html

work page 2008

[37] [37]

Eva: Exploring the limits of masked visual representation learning at scale,

Y . Fang, W. Wang, B. Xie, Q. Sun, L. Wu, X. Wang, T. Huang, X. Wang, and Y . Cao, “Eva: Exploring the limits of masked visual representation learning at scale,” in 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , 2023

work page 2023

[38] [38]

OPT: Open Pre-trained Transformer Language Models

S. Zhang et al. , “Opt: Open pre-trained transformer language models,” 2022. [Online]. Available: https://arxiv.org/abs/2205.01068

work page internal anchor Pith review Pith/arXiv arXiv 2022

[39] [39]

Decoupled weight decay regularization,

I. Loshchilov and F. Hutter, “Decoupled weight decay regularization,” in International Conference on Learning Representations , 2019

work page 2019

[40] [40]

Microsoft coco: Common objects in context,

T.-Y . Lin, M. Maire, S. Belongie, L. Bourdev, R. Girshick, J. Hays, P. Perona, D. Ramanan, C. L. Zitnick, and P. Doll´ar, “Microsoft coco: Common objects in context,” in European Conference on Computer Vision (ECCV), 2014. 8

work page 2014

[41] [41]

Visual genome: Connecting language and vision using crowdsourced dense image annotations,

R. Krishna et al. , “Visual genome: Connecting language and vision using crowdsourced dense image annotations,” International Journal of Computer Vision , p. 32–73, 2017

work page 2017

[42] [42]

Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning,

P. Sharma, N. Ding, S. Goodman, and R. Soricut, “Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning,” in Proceedings of the Association for Computa- tional Linguistics, 2018, pp. 2556–2565

work page 2018

[43] [43]

Conceptual 12m: Pushing web-scale image-text pre-training to recognize long- tail visual concepts,

S. Changpinyo, P. Sharma, N. Ding, and R. Soricut, “Conceptual 12m: Pushing web-scale image-text pre-training to recognize long- tail visual concepts,” in 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , 2021, pp. 3557–3567

work page 2021

[44] [44]

Im2text: Describing images using 1 million captioned photographs,

V . Ordonez, G. Kulkarni, and T. Berg, “Im2text: Describing images using 1 million captioned photographs,” in Advances in Neural In- formation Processing Systems, J. Shawe-Taylor, R. Zemel, P. Bartlett, F. Pereira, and K. Weinberger, Eds., vol. 24. Curran Associates, Inc., 2011

work page 2011

[45] [45]

Laion-400m: Open dataset of clip-filtered 400 million image-text pairs,

C. Schuhmann, R. Vencu, R. Beaumont, R. Kaczmarczyk, C. Mullis, A. Katta, T. Coombes, J. Jitsev, and A. Komatsuzaki, “Laion-400m: Open dataset of clip-filtered 400 million image-text pairs,” 2021

work page 2021

[46] [46]

Laion-5b: An open large-scale dataset for training next generation image-text models,

C. Schuhmann et al. , “Laion-5b: An open large-scale dataset for training next generation image-text models,” in Advances in Neural Information Processing Systems , 2022, pp. 25 278–25 294

work page 2022

[47] [47]

Deep hashing network for unsupervised domain adaptation,

H. Venkateswara, J. Eusebio, S. Chakraborty, and S. Panchanathan, “Deep hashing network for unsupervised domain adaptation,” in 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017, pp. 5018–5027

work page 2017

[48] [48]

Prototypical contrastive learning of unsupervised representations,

J. Li, P. Zhou, C. Xiong, and S. Hoi, “Prototypical contrastive learning of unsupervised representations,” inInternational Conference on Learning Representations (ICLR) , 2021

work page 2021

[49] [49]

Prototypical cross-domain self-supervised learning for few-shot unsupervised domain adaptation,

X. Yue et al., “Prototypical cross-domain self-supervised learning for few-shot unsupervised domain adaptation,” in 2021 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , 2021

work page 2021

[50] [50]

Feature representation learning for unsu- pervised cross-domain image retrieval,

C. Hu and G. H. Lee, “Feature representation learning for unsu- pervised cross-domain image retrieval,” in European Conference on Computer Vision (ECCV) , 2022

work page 2022

[51] [51]

Correspondence-free domain alignment for unsupervised cross-domain image retrieval,

X. Wang, D. Peng, M. Yan, and P. Hu, “Correspondence-free domain alignment for unsupervised cross-domain image retrieval,” Proceed- ings of the AAAI Conference on Artificial Intelligence , vol. 37, no. 8, pp. 10 200–10 208, 2023. 9

work page 2023