pith. sign in

arxiv: 2403.15152 · v3 · submitted 2024-03-22 · 💻 cs.CV

Caption-Matching: A Multimodal Approach for Cross-Domain Image Retrieval

Pith reviewed 2026-05-24 03:06 UTC · model grok-4.3

classification 💻 cs.CV
keywords cross-domain image retrievalcaption matchingvision-language modelszero-shot retrievalunsupervised domain adaptationmultimodal similarityplug-and-play retrieval
0
0 comments X

The pith

Generated image captions act as a domain-agnostic bridge for retrieving matches across sketches, paintings, and photographs without labels or training.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents Caption-Matching as a way to turn images from mismatched visual domains into text descriptions using existing pre-trained models. These descriptions then serve as the basis for computing similarity between images that otherwise look unrelated due to style differences. The approach requires no labeled cross-domain pairs and no additional training on target data, which removes common barriers in cross-domain retrieval. Results on Office-Home and DomainNet show consistent gains over earlier methods, and the same pipeline handles queries that mix real and AI-generated images. A reader would care because it offers a lightweight alternative to supervised domain adaptation for tasks where collecting matched examples is impractical.

Core claim

Caption-Matching generates captions for each image with pre-trained vision-language models and treats those captions as an intermediate representation that remains stable across visual domains. Similarity between images from different domains is then computed directly on the captions, which enables retrieval without any requirement for labeled cross-domain correspondences or model fine-tuning. The method records state-of-the-art performance in plug-and-play settings on Office-Home and DomainNet while also succeeding on queries that combine real photographs with AI-generated images from Midjourney.

What carries the argument

Caption-Matching (CM), which converts each image into a caption via a pre-trained vision-language model and uses caption similarity for cross-domain matching.

If this is right

  • Cross-domain retrieval works in a fully unsupervised plug-and-play manner on standard benchmarks.
  • Performance improves on Office-Home and DomainNet without domain-specific training or labeled pairs.
  • The same caption-based matching extends to queries that mix multiple visual domains including AI-generated images.
  • No requirement for target-domain labels lowers the data cost of deploying retrieval systems across new styles.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same caption intermediary could be tested on video frames or 3D renders to check whether language still bridges those modality gaps.
  • Replacing the caption generator with a stronger model would be a direct way to measure how much retrieval quality depends on caption fidelity.
  • Domains that produce highly ambiguous captions, such as abstract art, would expose where the method's performance ceiling lies.
  • Caption matching could simplify retrieval pipelines that already use language as a common space for other modalities.

Load-bearing premise

Captions produced by current pre-trained vision-language models retain enough semantic detail to serve as a reliable domain-invariant signal for similarity computation.

What would settle it

A drop in retrieval accuracy on domain pairs where the same visual content receives markedly different captions from the model, or where visually dissimilar content receives similar captions.

Figures

Figures reproduced from arXiv: 2403.15152 by Lucas Iijima, Nikolaos Giakoumoglou, Tania Stathaki.

Figure 1
Figure 1. Figure 1: Overview of the proposed Caption-Matching (CM) method. The process begins with the BLIP-2 model [PITH_FULL_IMAGE:figures/full_fig_p004_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: t-SNE visualization of text embeddings generated [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Qualitative comparison of the top-10 retrieved [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Top-5 retrieval results on Midjourney’s database. [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗
read the original abstract

Cross-Domain Image Retrieval (CDIR) is a challenging task in computer vision, aiming to match images across different visual domains such as sketches, paintings, and photographs. Existing CDIR methods rely either on supervised learning with labeled cross-domain correspondences or on methods that require training or fine-tuning on target datasets, often struggling with substantial domain gaps and limited generalization to unseen domains. This paper introduces a novel CDIR approach that incorporates textual context by leveraging publicly available pre-trained vision-language models. Our method, Caption-Matching (CM), uses generated image captions as a domain-agnostic intermediate representation, enabling effective cross-domain similarity computation without the need for labeled data or further training. We evaluate our method on standard CDIR benchmark datasets, demonstrating state-of-the-art performance in plug-and-play settings with consistent improvements on Office-Home and DomainNet over previous methods. We also demonstrate our method's effectiveness on a dataset of AI-generated images from Midjourney, showcasing its ability to handle complex, multi-domain queries.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper proposes Caption-Matching (CM), a training-free cross-domain image retrieval method that generates captions from pre-trained vision-language models (e.g., BLIP, LLaVA) and performs similarity search in caption embedding space. It claims this yields state-of-the-art results on Office-Home and DomainNet without labeled data or fine-tuning, plus results on Midjourney AI-generated images.

Significance. If the central claim holds, the work would demonstrate a simple plug-and-play zero-shot CDIR approach that sidesteps supervised domain adaptation. The significance is tempered by the absence of reported quantitative metrics, ablations, or caption-fidelity analysis in the abstract and by the method's dependence on external pre-trained VLMs whose behavior on non-photographic domains is uncharacterized.

major comments (3)
  1. [Abstract, §4] Abstract and §4 (Experiments): The abstract asserts SOTA performance on Office-Home and DomainNet yet supplies no numerical results (e.g., mAP, recall@K), baseline comparisons, or ablation tables; without these data the central empirical claim cannot be evaluated.
  2. [§3] §3 (Method): The domain-agnostic claim rests on the untested assumption that captions from photo-centric VLMs preserve fine-grained semantics on sketches, paintings, and clip-art; no per-domain caption accuracy, hallucination rate, or fidelity metrics are provided to support this.
  3. [§4] §4: No error analysis or controlled experiment isolates whether retrieval gains survive when caption quality is degraded (e.g., by substituting generic or incorrect captions), leaving the weakest assumption unexamined.
minor comments (2)
  1. [§3.1] Notation for the text encoder and similarity function should be introduced with explicit equations rather than prose descriptions.
  2. [§4.3] The Midjourney dataset description lacks details on query construction and domain composition.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment point-by-point below and indicate the revisions planned for the next version of the manuscript.

read point-by-point responses
  1. Referee: [Abstract, §4] Abstract and §4 (Experiments): The abstract asserts SOTA performance on Office-Home and DomainNet yet supplies no numerical results (e.g., mAP, recall@K), baseline comparisons, or ablation tables; without these data the central empirical claim cannot be evaluated.

    Authors: We agree that the abstract would be strengthened by including key numerical results. In the revised manuscript we will update the abstract to report the main mAP and Recall@K figures on Office-Home and DomainNet together with the primary baseline comparisons. The full tables already appear in §4; the change is limited to surfacing the headline numbers in the abstract. revision: yes

  2. Referee: [§3] §3 (Method): The domain-agnostic claim rests on the untested assumption that captions from photo-centric VLMs preserve fine-grained semantics on sketches, paintings, and clip-art; no per-domain caption accuracy, hallucination rate, or fidelity metrics are provided to support this.

    Authors: The manuscript presents the domain-agnostic claim through consistent retrieval gains across all domains in the benchmarks. While we do not supply explicit per-domain caption fidelity or hallucination statistics, the end-to-end results serve as indirect validation. To address the request directly we will add, in the revision, a short qualitative caption analysis across domains and any available quantitative fidelity numbers on annotated subsets. revision: partial

  3. Referee: [§4] §4: No error analysis or controlled experiment isolates whether retrieval gains survive when caption quality is degraded (e.g., by substituting generic or incorrect captions), leaving the weakest assumption unexamined.

    Authors: We concur that an explicit ablation on caption quality would be informative. In the revised §4 we will insert a controlled experiment that replaces the generated captions with generic or random text and reports the resulting drop in retrieval metrics, thereby isolating the contribution of caption semantics. revision: yes

Circularity Check

0 steps flagged

No circularity; retrieval performance depends on external pre-trained VLMs

full rationale

The paper introduces Caption-Matching by generating captions from off-the-shelf vision-language models (BLIP, LLaVA, etc.) and then computing text similarity. No equations, fitted parameters, or self-citations appear in the provided text. The central claim does not reduce any quantity to a definition or fit internal to the work; performance is explicitly attributed to the external models' semantic preservation, which is an assumption but not a circular derivation. This matches the default expectation of a self-contained method whose validity rests on external components.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain-agnostic property of captions and the semantic fidelity of off-the-shelf VLMs; no free parameters, ad-hoc axioms, or new entities are introduced in the abstract.

axioms (1)
  • domain assumption Pre-trained vision-language models produce captions that capture semantic content independent of visual domain style.
    Invoked when the paper states that captions serve as a domain-agnostic intermediate representation.

pith-pipeline@v0.9.0 · 5708 in / 1182 out tokens · 43763 ms · 2026-05-24T03:06:07.526956+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

51 extracted references · 51 canonical work pages · 3 internal anchors

  1. [1]

    Image retrieval: Ideas, influences, and trends of the new age,

    R. Datta, D. Joshi, J. Li, and J. Z. Wang, “Image retrieval: Ideas, influences, and trends of the new age,” ACM Computing Surveys (Csur), vol. 40, no. 2, pp. 1–60, 2008

  2. [2]

    Vqa: Visual question answering,

    S. Antol, A. Agrawal, J. Lu, M. Mitchell, D. Batra, C. Lawrence Zit- nick, and D. Parikh, “Vqa: Visual question answering,” in 2015 IEEE International Conference on Computer Vision (ICCV) , 2015, pp. 2425–2433

  3. [3]

    Learning fine-grained image similarity with deep ranking,

    J. Wang, Y . Song, T. Leung, C. Rosenberg, J. Wang, J. Philbin, B. Chen, and Y . Wu, “Learning fine-grained image similarity with deep ranking,” in 2014 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2014, pp. 1386–1393

  4. [4]

    Adversarial cross-modal retrieval,

    B. Wang, Y . Yang, X. Xu, A. Hanjalic, and H. T. Shen, “Adversarial cross-modal retrieval,” in Proceedings of the 25th ACM International Conference on Multimedia , 2017, pp. 154–162. 7

  5. [5]

    Domain Adaptation for Visual Applications: A Comprehensive Survey

    G. Csurka, “Domain adaptation for visual applications: A compre- hensive survey,” arXiv preprint arXiv:1702.05374 , 2017

  6. [6]

    The sketchy database: learning to retrieve badly drawn bunnies,

    P. Sangkloy, N. Burnell, C. Ham, and J. Hays, “The sketchy database: learning to retrieve badly drawn bunnies,” in ACM SIGGRAPH 2016 Papers, 2016, pp. 1–12

  7. [7]

    Sketch me that shoe,

    Q. Yu, F. Liu, Y .-Z. Song, T. Xiang, T. M. Hospedales, and C.-C. Loy, “Sketch me that shoe,” in 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , 2016, pp. 799–807

  8. [8]

    Deep spatial-semantic attention for fine-grained sketch-based image retrieval,

    J. Song, Q. Yu, Y .-Z. Song, T. Xiang, and T. M. Hospedales, “Deep spatial-semantic attention for fine-grained sketch-based image retrieval,” in2017 IEEE International Conference on Computer Vision (ICCV), 2017, pp. 5551–5560

  9. [9]

    Deep learning for generic object detection: A sur- vey,

    L. Liu, W. Ouyang, X. Wang, P. Fieguth, J. Chen, X. Liu, and M. Pietik ¨ainen, “Deep learning for generic object detection: A sur- vey,” International Journal of Computer Vision , vol. 128, pp. 261– 318, 2020

  10. [10]

    Adversarial discriminative domain adaptation,

    E. Tzeng, J. Hoffman, K. Saenko, and T. Darrell, “Adversarial discriminative domain adaptation,” in 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , 2017, pp. 2962– 2971

  11. [11]

    Domain-adversarial training of neural networks,

    Y . Ganin, E. Ustinova, H. Ajakan, P. Germain, H. Larochelle, F. Lavi- olette, M. Marchand, and V . Lempitsky, “Domain-adversarial training of neural networks,” in Domain Adaptation in Computer Vision Applications. Springer, 2017, pp. 189–209

  12. [12]

    Moment matching for multi-source domain adaptation,

    X. Peng, Q. Bai, X. Xia, Z. Huang, K. Saenko, and B. Wang, “Moment matching for multi-source domain adaptation,” in 2019 IEEE/CVF International Conference on Computer Vision (ICCV) , 2019, pp. 1406–1415

  13. [13]

    Learning transferable visual models from natural lan- guage supervision,

    A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agar- wal, G. Sastry, A. Askell, P. Mishkin, J. Clark, G. Krueger, and I. Sutskever, “Learning transferable visual models from natural lan- guage supervision,” in International Conference on Machine Learn- ing. PMLR, 2021, pp. 8748–8763

  14. [14]

    Scaling up visual and vision-language representation learning with noisy text supervision,

    C. Jia et al. , “Scaling up visual and vision-language representation learning with noisy text supervision,” in International Conference on Machine Learning. PMLR, 2021, pp. 4904–4916

  15. [15]

    BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models

    J. Li, D. Li, S. Savarese, and S. Hoi, “Blip-2: Bootstrapping language- image pre-training with frozen image encoders and large language models,” arXiv preprint arXiv:2301.12597 , 2023

  16. [16]

    Ob- ject retrieval with large vocabularies and fast spatial matching,

    J. Philbin, O. Chum, M. Isard, J. Sivic, and A. Zisserman, “Ob- ject retrieval with large vocabularies and fast spatial matching,” in 2007 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, 2007, pp. 1–8

  17. [17]

    Hamming embedding and weak geometric consistency for large scale image search,

    H. Jegou, M. Douze, and C. Schmid, “Hamming embedding and weak geometric consistency for large scale image search,” in European Conference on Computer Vision (ECCV) . Springer, 2008, pp. 304– 317

  18. [18]

    Large-scale image retrieval with attentive deep local features,

    H. Noh, A. Araujo, J. Sim, T. Weyand, and B. Han, “Large-scale image retrieval with attentive deep local features,” in 2017 IEEE International Conference on Computer Vision (ICCV) , 2017, pp. 3456–3465

  19. [19]

    Fine-tuning cnn image retrieval with no human annotation,

    F. Radenovi ´c, G. Tolias, and O. Chum, “Fine-tuning cnn image retrieval with no human annotation,” IEEE Transactions on Pattern Analysis and Machine Intelligence , vol. 41, no. 7, pp. 1655–1668, 2019

  20. [20]

    Efficient image retrieval via decoupling diffusion into online and offline processing,

    F. Yang, R. Hinami, Y . Matsui, S. Ly, and S. Satoh, “Efficient image retrieval via decoupling diffusion into online and offline processing,” in Proceedings of the AAAI Conference on Artificial Intelligence , vol. 33, 2019

  21. [21]

    Solar: Second-order loss and attention for image retrieval,

    T. Ng, V . Balntas, Y . Tian, and K. Mikolajczyk, “Solar: Second-order loss and attention for image retrieval,” in European Conference on Computer Vision (ECCV) . Springer, 2020, pp. 253–270

  22. [22]

    Dolg: Single-stage image retrieval with deep orthogonal fusion of local and global features,

    M. Yang, D. He, M. Fan, B. Shi, X. Xue, F. Li, E. Ding, and J. Huang, “Dolg: Single-stage image retrieval with deep orthogonal fusion of local and global features,” in 2021 IEEE/CVF International Conference on Computer Vision (ICCV) , 2021, pp. 11 772–11 781

  23. [23]

    Revisiting self-similarity: Structural embedding for image retrieval,

    S. Lee, S. Lee, H. Seong, and E. Kim, “Revisiting self-similarity: Structural embedding for image retrieval,” in 2023 IEEE/CVF Con- ference on Computer Vision and Pattern Recognition (CVPR) , 2023, pp. 23 412–23 421

  24. [24]

    Cross-domain image retrieval with a dual attribute-aware ranking network,

    J. Huang, R. S. Feris, Q. Chen, and S. Yan, “Cross-domain image retrieval with a dual attribute-aware ranking network,” in 2015 IEEE International Conference on Computer Vision (ICCV) , 2015, pp. 1062–1070

  25. [25]

    Cds: Cross-domain self-supervised pre-training,

    D. Kim, K. Saito, T.-H. Oh, B. A. Plummer, S. Sclaroff, and K. Saenko, “Cds: Cross-domain self-supervised pre-training,” in 2021 IEEE/CVF International Conference on Computer Vision (ICCV) , 2021, pp. 9123–9132

  26. [26]

    Learning to prompt for vision-language models,

    K. Zhou, J. Yang, C. C. Loy, and Z. Liu, “Learning to prompt for vision-language models,” International Journal of Computer Vision , vol. 130, pp. 2337–2348, 2022

  27. [27]

    Clip models are few-shot learners: Empirical studies on vqa and visual entailment,

    H. Song, L. Dong, W. Zhang, T. Liu, and F. Wei, “Clip models are few-shot learners: Empirical studies on vqa and visual entailment,” in Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics, 2022, pp. 6088–6100

  28. [28]

    Flamingo: a visual language model for few- shot learning,

    J.-B. Alayrac et al. , “Flamingo: a visual language model for few- shot learning,” Advances in Neural Information Processing Systems , vol. 35, pp. 23 716–23 736, 2022

  29. [29]

    Blip: Bootstrapping language- image pre-training for unified vision-language understanding and gen- eration,

    J. Li, D. Li, C. Xiong, and S. Hoi, “Blip: Bootstrapping language- image pre-training for unified vision-language understanding and gen- eration,” in International Conference on Machine Learning . PMLR, 2022, pp. 12 888–12 900

  30. [30]

    Flava: A foundational language and vision alignment model,

    A. Singh, R. Hu, V . Goswami, G. Couairon, W. Galuba, M. Rohrbach, and D. Kiela, “Flava: A foundational language and vision alignment model,” in 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022, pp. 15 638–15 650

  31. [31]

    Fine-grained image retrieval: the text/sketch input dilemma,

    J. Song, Y .-Z. Song, T. Xiang, and T. Hospedales, “Fine-grained image retrieval: the text/sketch input dilemma,” in Proceedings of the British Machine Vision Conference (BMVC) , 2017, pp. 45.1–45.12

  32. [32]

    A sketch is worth a thousand words: Image retrieval with text and sketch,

    P. Sangkloy, W. Jitkrittum, D. Yang, and J. Hays, “A sketch is worth a thousand words: Image retrieval with text and sketch,” in European Conference on Computer Vision (ECCV) , 2022, pp. 251–267

  33. [33]

    An image is worth 16x16 words: Transformers for image recognition at scale,

    A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, J. Uszkoreit, and N. Houlsby, “An image is worth 16x16 words: Transformers for image recognition at scale,” 2021

  34. [34]

    Language models are unsupervised multitask learners,

    A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, and I. Sutskever, “Language models are unsupervised multitask learners,”

  35. [35]

    Available: https://api.semanticscholar.org/CorpusID: 160025533

    [Online]. Available: https://api.semanticscholar.org/CorpusID: 160025533

  36. [36]

    Visualizing data using t-sne,

    L. van der Maaten and G. Hinton, “Visualizing data using t-sne,” Journal of Machine Learning Research , vol. 9, no. 86, pp. 2579–2605, 2008. [Online]. Available: http://jmlr.org/papers/v9/van dermaaten08a.html

  37. [37]

    Eva: Exploring the limits of masked visual representation learning at scale,

    Y . Fang, W. Wang, B. Xie, Q. Sun, L. Wu, X. Wang, T. Huang, X. Wang, and Y . Cao, “Eva: Exploring the limits of masked visual representation learning at scale,” in 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , 2023

  38. [38]

    OPT: Open Pre-trained Transformer Language Models

    S. Zhang et al. , “Opt: Open pre-trained transformer language models,” 2022. [Online]. Available: https://arxiv.org/abs/2205.01068

  39. [39]

    Decoupled weight decay regularization,

    I. Loshchilov and F. Hutter, “Decoupled weight decay regularization,” in International Conference on Learning Representations , 2019

  40. [40]

    Microsoft coco: Common objects in context,

    T.-Y . Lin, M. Maire, S. Belongie, L. Bourdev, R. Girshick, J. Hays, P. Perona, D. Ramanan, C. L. Zitnick, and P. Doll´ar, “Microsoft coco: Common objects in context,” in European Conference on Computer Vision (ECCV), 2014. 8

  41. [41]

    Visual genome: Connecting language and vision using crowdsourced dense image annotations,

    R. Krishna et al. , “Visual genome: Connecting language and vision using crowdsourced dense image annotations,” International Journal of Computer Vision , p. 32–73, 2017

  42. [42]

    Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning,

    P. Sharma, N. Ding, S. Goodman, and R. Soricut, “Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning,” in Proceedings of the Association for Computa- tional Linguistics, 2018, pp. 2556–2565

  43. [43]

    Conceptual 12m: Pushing web-scale image-text pre-training to recognize long- tail visual concepts,

    S. Changpinyo, P. Sharma, N. Ding, and R. Soricut, “Conceptual 12m: Pushing web-scale image-text pre-training to recognize long- tail visual concepts,” in 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , 2021, pp. 3557–3567

  44. [44]

    Im2text: Describing images using 1 million captioned photographs,

    V . Ordonez, G. Kulkarni, and T. Berg, “Im2text: Describing images using 1 million captioned photographs,” in Advances in Neural In- formation Processing Systems, J. Shawe-Taylor, R. Zemel, P. Bartlett, F. Pereira, and K. Weinberger, Eds., vol. 24. Curran Associates, Inc., 2011

  45. [45]

    Laion-400m: Open dataset of clip-filtered 400 million image-text pairs,

    C. Schuhmann, R. Vencu, R. Beaumont, R. Kaczmarczyk, C. Mullis, A. Katta, T. Coombes, J. Jitsev, and A. Komatsuzaki, “Laion-400m: Open dataset of clip-filtered 400 million image-text pairs,” 2021

  46. [46]

    Laion-5b: An open large-scale dataset for training next generation image-text models,

    C. Schuhmann et al. , “Laion-5b: An open large-scale dataset for training next generation image-text models,” in Advances in Neural Information Processing Systems , 2022, pp. 25 278–25 294

  47. [47]

    Deep hashing network for unsupervised domain adaptation,

    H. Venkateswara, J. Eusebio, S. Chakraborty, and S. Panchanathan, “Deep hashing network for unsupervised domain adaptation,” in 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017, pp. 5018–5027

  48. [48]

    Prototypical contrastive learning of unsupervised representations,

    J. Li, P. Zhou, C. Xiong, and S. Hoi, “Prototypical contrastive learning of unsupervised representations,” inInternational Conference on Learning Representations (ICLR) , 2021

  49. [49]

    Prototypical cross-domain self-supervised learning for few-shot unsupervised domain adaptation,

    X. Yue et al., “Prototypical cross-domain self-supervised learning for few-shot unsupervised domain adaptation,” in 2021 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , 2021

  50. [50]

    Feature representation learning for unsu- pervised cross-domain image retrieval,

    C. Hu and G. H. Lee, “Feature representation learning for unsu- pervised cross-domain image retrieval,” in European Conference on Computer Vision (ECCV) , 2022

  51. [51]

    Correspondence-free domain alignment for unsupervised cross-domain image retrieval,

    X. Wang, D. Peng, M. Yan, and P. Hu, “Correspondence-free domain alignment for unsupervised cross-domain image retrieval,” Proceed- ings of the AAAI Conference on Artificial Intelligence , vol. 37, no. 8, pp. 10 200–10 208, 2023. 9