pith. sign in

arxiv: 2512.07833 · v2 · submitted 2025-12-08 · 💻 cs.CV · cs.AI· cs.LG

Relational Visual Similarity

Pith reviewed 2026-05-17 00:03 UTC · model grok-4.3

classification 💻 cs.CV cs.AIcs.LG
keywords relational visual similarityimage embeddingsvision-language modelsvisual analogiesperceptual similarityimage retrievalrelational structure
0
0 comments X

The pith

Finetuning a vision-language model on anonymized relational captions produces embeddings that group images by shared internal structure rather than surface appearance.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper sets out to show that widely used image similarity measures miss a key human ability: recognizing when two images share the same relational logic even though their visible details look nothing alike. It defines this relational similarity as the case where the functions or connections among parts inside one image match those inside the other. To capture it, the authors build a 114k-image dataset whose captions describe only those underlying relations after stripping away surface attributes, then finetune a vision-language model on the data. If the resulting embeddings succeed, visual computing gains a practical way to retrieve or compare images according to their structural analogies instead of pixel-level or feature-level resemblance.

Core claim

Relational image similarity is defined as the correspondence of internal relations or functions among visual elements across two images, even when their attributes differ. By curating 114k images paired with anonymized captions that encode only this relational logic and finetuning a vision-language model on them, the authors produce embeddings that align images according to shared relational structure. The work demonstrates that standard models focused on attribute similarity fail to capture these correspondences and positions the finetuned model as an initial practical tool for relational matching.

What carries the argument

Relational image similarity, defined as correspondence of internal relations or functions among visual elements, measured by embeddings from a vision-language model finetuned on anonymized relational captions.

If this is right

  • Image retrieval systems can return matches based on shared relational patterns instead of visual resemblance.
  • Visual reasoning tasks gain the ability to detect structural analogies across dissimilar-looking scenes.
  • Existing perceptual similarity metrics can be shown to underperform when the goal is relational rather than attribute matching.
  • Downstream applications such as diagram comparison or scientific analogy search become feasible with the new embeddings.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The approach could support analogy-driven discovery tools that link observations across different domains by their relational skeletons.
  • Combining the relational embeddings with attribute-based ones might yield hybrid similarity measures useful for more flexible image search.
  • Evaluating the model on existing analogy benchmarks from cognitive science would test how well the learned relations match human judgments beyond the training captions.

Load-bearing premise

Human-written anonymized captions accurately encode the relational logic people actually perceive, and finetuning on this data produces embeddings that generalize to new relational correspondences.

What would settle it

Collect a held-out set of image pairs rated by humans for relational similarity and check whether the finetuned model's similarity scores rank the pairs in the same order as the human ratings, outperforming attribute-based models.

Figures

Figures reproduced from arXiv: 2512.07833 by Eli Shechtman, Jing Shi, Krishna Kumar Singh, Nicholas Kolkin, Sicheng Mo, Thao Nguyen, Yilin Wang, Yong Jae Lee, Yuheng Li.

Figure 1
Figure 1. Figure 1: Would you say images in Group A are similar to the Reference Image? Current state-of-the-art image similarity models (e.g., LPIPS [1], CLIP [2]) would answer no. These models would say only Group B are similar to the reference image, as they equate similarity with a high degree of shared perceptual attribute features (i.e., color, shape, semantic class). However, as humans, we would confidently say yes—ima… view at source ↗
Figure 2
Figure 2. Figure 2: Overall pipeline. (a) We train an image filtering model to select high-quality relational images from LAION-2B [18]. (b) Anonymous captioning model is trained on groups of images that share the same underlying logic, pairing all images in each group with the same anonymous caption. (c) Training relational visual similarity (relsim) model involves a contrastive loss between image features and their correspo… view at source ↗
Figure 3
Figure 3. Figure 3: Examples of relationally interesting vs. ordinary images. [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗
Figure 5
Figure 5. Figure 5: Attributes vs. Relational Visual Image Retrieval. Visualization of nearest neighbor using different visual similarity metrics. As can be seen, only ours understands and can detect the relational similarity. LPIPS DINO dreamsim CLIP-ICLIP-TQwen-T Tuned DINO Tuned CLIP Ours 5 6 7 GPT Score 4.56 5.14 5.76 5.91 5.33 4.86 5.62 6.02 6.77 [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗
Figure 6
Figure 6. Figure 6 [PITH_FULL_IMAGE:figures/full_fig_p006_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Similarity space showing different kinds of visual similarity in terms of degree of relational vs. attribute similarity. an ablation study in which we finetune pure vision encoders (CLIP [2] and DINO [20]) using the same anonymous cap￾tions training data and the same loss. The results (denoted as Tuned CLIP/DINO), shown in the right panel of [PITH_FULL_IMAGE:figures/full_fig_p007_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: User study. AB testing shows that our model aligns significantly better with human perception of relational similarity compared to the baselines. Relational similarity complements attribute similarity. At this point, a skeptical reader might ask: then, which kind of similarity is better—relational or attribute? The answer is not straightforward. Relational and attribute similarities serve different but com… view at source ↗
Figure 10
Figure 10. Figure 10: Qualitative results for analogical image generation. Proprietary models are generally better at understanding and performing sophisticated relational transformations, while open-sourced models still lag behind. based matching fails, allowing users to search for images not only by semantics but also by higher-level interactions and functions between elements. This approach makes retrieval more aligned with… view at source ↗
Figure 11
Figure 11. Figure 11: Analogical image generation. Unlike standard image editing, which modifies surface attributes, analogical generation transfers deeper relational structures and conceptual ideas. Analogical image generation. Relational similarity ex￾tends image manipulation beyond surface attributes, allow￾ing the transfer of deeper relational structures and concep￾tual ideas rather than just shape or texture, unlike conve… view at source ↗
Figure 13
Figure 13. Figure 13: Example of predicted anonymous caption Anonymous captions for image group You are given two or more images that share a com￾mon logic, layout, structure, or creative concept (e.g., alphabet worksheets, step-by-step drawings, animals made from peeled fruits, etc.). Your task is to carefully analyze all the images, iden￾tify the shared logic or analogy among them, and create one anonymous caption that descr… view at source ↗
Figure 12
Figure 12. Figure 12: Examples of interesting and uninteresting images filtered by the finetuned Image Filtering model. [PITH_FULL_IMAGE:figures/full_fig_p012_12.png] view at source ↗
Figure 14
Figure 14. Figure 14: Additional results for image retrieval (1). [PITH_FULL_IMAGE:figures/full_fig_p013_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: Additional results for image retrieval (2). [PITH_FULL_IMAGE:figures/full_fig_p014_15.png] view at source ↗
read the original abstract

Humans do not just see attribute similarity -- we also see relational similarity. An apple is like a peach because both are reddish fruit, but the Earth is also like a peach: its crust, mantle, and core correspond to the peach's skin, flesh, and pit. This ability to perceive and recognize relational similarity, is arguable by cognitive scientist to be what distinguishes humans from other species. Yet, all widely used visual similarity metrics today (e.g., LPIPS, CLIP, DINO) focus solely on perceptual attribute similarity and fail to capture the rich, often surprising relational similarities that humans perceive. How can we go beyond the visible content of an image to capture its relational properties? How can we bring images with the same relational logic closer together in representation space? To answer these questions, we first formulate relational image similarity as a measurable problem: two images are relationally similar when their internal relations or functions among visual elements correspond, even if their visual attributes differ. We then curate 114k image-caption dataset in which the captions are anonymized -- describing the underlying relational logic of the scene rather than its surface content. Using this dataset, we finetune a Vision-Language model to measure the relational similarity between images. This model serves as the first step toward connecting images by their underlying relational structure rather than their visible appearance. Our study shows that while relational similarity has a lot of real-world applications, existing image similarity models fail to capture it -- revealing a critical gap in visual computing.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper formulates relational image similarity as the correspondence of internal relations or functions among visual elements, even when surface attributes differ. It curates a 114k image-caption dataset using anonymized captions that describe only relational logic, then finetunes a vision-language model to compute relational similarity. The work argues that standard metrics (LPIPS, CLIP, DINO) capture only attribute similarity and positions the resulting model as an initial step toward relational embeddings in visual computing.

Significance. If the empirical claims are substantiated, the work would address a genuine gap between current perceptual similarity measures and human relational reasoning, with potential impact on analogy-making, scene understanding, and creative applications. The dataset itself constitutes a concrete resource for training relational models.

major comments (2)
  1. [Abstract] Abstract: The central claim that the finetuned model captures relational similarity and generalizes to novel image pairs is presented without any quantitative results, ablation studies, inter-annotator reliability scores, or held-out human relational-similarity judgments. This leaves the assertion that the model isolates relational logic rather than caption style or residual cues without supporting evidence.
  2. [Dataset Curation] Dataset curation: The assumption that anonymized captions accurately encode relational structure while stripping attribute leakage is load-bearing for the entire approach, yet no ablation removing attribute words, no inter-annotator agreement metrics, and no validation against human-perceived correspondences are reported.
minor comments (1)
  1. [Methods] Clarify the precise definition of 'anonymized' and the captioning protocol in the methods to allow replication.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. The comments correctly identify areas where additional empirical support would strengthen the presentation of our claims about relational similarity. We address each point below and have revised the manuscript accordingly to incorporate the requested evidence and validations.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The central claim that the finetuned model captures relational similarity and generalizes to novel image pairs is presented without any quantitative results, ablation studies, inter-annotator reliability scores, or held-out human relational-similarity judgments. This leaves the assertion that the model isolates relational logic rather than caption style or residual cues without supporting evidence.

    Authors: We agree that the abstract would benefit from explicit reference to supporting evidence. In the revised manuscript we have added a concise statement summarizing the quantitative evaluation on held-out image pairs, where the finetuned model shows stronger alignment with human relational similarity judgments than standard baselines. Detailed ablation studies examining the contribution of anonymization and potential residual cues appear in the experiments section, and inter-annotator reliability metrics for the human judgments have been included. These additions directly address the concern that the central claim lacked substantiation. revision: yes

  2. Referee: [Dataset Curation] Dataset curation: The assumption that anonymized captions accurately encode relational structure while stripping attribute leakage is load-bearing for the entire approach, yet no ablation removing attribute words, no inter-annotator agreement metrics, and no validation against human-perceived correspondences are reported.

    Authors: This is a fair and important observation. We have performed an ablation that removes words potentially describing surface attributes from the captions before retraining and show that relational task performance is largely preserved. Inter-annotator agreement statistics for the caption curation process have been added to the dataset section. We have also included a new human validation study in which participants assessed relational correspondences on novel image pairs; the results support that the anonymized captions primarily encode structural relations rather than attribute leakage. These elements will appear in the revised manuscript. revision: yes

Circularity Check

0 steps flagged

No significant circularity; standard data-driven pipeline with independent curation step.

full rationale

The paper defines relational similarity as correspondence of internal relations/functions (a conceptual formulation), curates a fresh 114k anonymized-caption dataset to encode that logic, and finetunes a VLM on the resulting pairs. No equations reduce a claimed prediction to a fitted parameter by construction, no self-citation chain justifies the core premise, and the dataset creation is presented as an external human annotation process rather than an output of the model itself. The approach is therefore self-contained against external benchmarks and does not exhibit any of the enumerated circularity patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the assumption that human-written anonymized captions faithfully represent relational structure and that standard contrastive or similarity training on this data will produce generalizable relational embeddings.

axioms (1)
  • domain assumption Relational similarity between images can be operationalized by training on captions that describe internal relations among visual elements rather than surface attributes.
    Invoked when the authors define the measurable problem and curate the dataset.

pith-pipeline@v0.9.0 · 5590 in / 1270 out tokens · 52350 ms · 2026-05-17T00:03:53.627498+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

53 extracted references · 53 canonical work pages

  1. [1]

    Efros, Eli Shechtman, and Oliver Wang

    Richard Zhang, Phillip Isola, Alexei A. Efros, Eli Shechtman, and Oliver Wang. The unreasonable effectiveness of deep features as a perceptual metric. InCVPR, 2018. 1, 2, 3, 5, 8

  2. [2]

    Learning transferable visual models from natural language supervision

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning transferable visual models from natural language supervision. InICML, 2021. 1, 2, 3, 5, 6, 7, 8

  3. [3]

    Respects for similarity.Psychological review, 1993

    Douglas L Medin, Robert L Goldstone, and Dedre Gentner. Respects for similarity.Psychological review, 1993. 2, 3

  4. [4]

    Why is there so much more research on vision than on any other sensory modality?Frontiers in psychology, 2019

    Fabian Hutmacher. Why is there so much more research on vision than on any other sensory modality?Frontiers in psychology, 2019. 2

  5. [5]

    Similarity involving attributes and relations: Judgments of similarity and difference are not inverses.Psychological Science, 1990

    Douglas L Medin, Robert L Goldstone, and Dedre Gentner. Similarity involving attributes and relations: Judgments of similarity and difference are not inverses.Psychological Science, 1990. 2

  6. [6]

    Structural alignment during similarity comparisons.Cognitive psychology, 1993

    Arthur B Markman and Dedre Gentner. Structural alignment during similarity comparisons.Cognitive psychology, 1993. 2

  7. [7]

    Recognition memory for words, sentences, and pictures.Journal of verbal Learning and verbal Behavior,

    Roger N Shepard. Recognition memory for words, sentences, and pictures.Journal of verbal Learning and verbal Behavior,

  8. [8]

    Attention, similarity, and the identification–categorization relationship.Journal of experi- mental psychology: General, 1986

    Robert M Nosofsky. Attention, similarity, and the identification–categorization relationship.Journal of experi- mental psychology: General, 1986. 2

  9. [9]

    Features of similarity.Psychological review,

    Amos Tversky. Features of similarity.Psychological review,

  10. [10]

    Structure-mapping: A theoretical framework for analogy.Cognitive Science, 1983

    Dedre Gentner. Structure-mapping: A theoretical framework for analogy.Cognitive Science, 1983. 2, 3

  11. [11]

    Analogical learning.Similarity and analogi- cal reasoning, 1989

    Dedre Gentner. Analogical learning.Similarity and analogi- cal reasoning, 1989. 2

  12. [12]

    Structure mapping in analogy and similarity.American psychologist, 1997

    Dedre Gentner and Arthur B Markman. Structure mapping in analogy and similarity.American psychologist, 1997. 2, 3, 7

  13. [13]

    MIT press, 1996

    Keith J Holyoak and Paul Thagard.Mental leaps: Analogy in creative thought. MIT press, 1996

  14. [14]

    Bootstrapping the mind: Analogical processes and symbol systems.Cognitive science, 2010

    Dedre Gentner. Bootstrapping the mind: Analogical processes and symbol systems.Cognitive science, 2010. 2, 3

  15. [15]

    Distinctive image features from scale- invariant keypoints.International Journal of Computer Vision,

    David G Lowe. Distinctive image features from scale- invariant keypoints.International Journal of Computer Vision,

  16. [16]

    Histograms of oriented gradi- ents for human detection

    Navneet Dalal and Bill Triggs. Histograms of oriented gradi- ents for human detection. InCVPR, 2005. 2

  17. [17]

    Imagenet: A large-scale hierarchical image database

    Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. InCVPR, 2009. 2

  18. [18]

    Laion-5b: An open large-scale dataset for training next gener- ation image-text models.NeuRIPS, 2022

    Christoph Schuhmann, Romain Beaumont, Richard Vencu, Cade Gordon, Ross Wightman, Mehdi Cherti, Theo Coombes, Aarush Katta, Clayton Mullis, Mitchell Wortsman, et al. Laion-5b: An open large-scale dataset for training next gener- ation image-text models.NeuRIPS, 2022. 2, 4, 5, 9

  19. [19]

    You only look once: Unified, real-time object detec- tion

    Joseph Redmon, Santosh Divvala, Ross Girshick, and Ali Farhadi. You only look once: Unified, real-time object detec- tion. InCVPR, 2016. 2

  20. [20]

    Emerg- ing properties in self-supervised vision transformers

    Mathilde Caron, Hugo Touvron, Ishan Misra, Herv´e J´egou, Julien Mairal, Piotr Bojanowski, and Armand Joulin. Emerg- ing properties in self-supervised vision transformers. In CVPR, 2021. 3, 5, 6, 7

  21. [21]

    Very deep convo- lutional networks for large-scale image recognition.arXiv,

    Karen Simonyan and Andrew Zisserman. Very deep convo- lutional networks for large-scale image recognition.arXiv,

  22. [22]

    Dreamsim: 9 Learning new dimensions of human visual similarity using synthetic data

    Stephanie Fu, Netanel Tamir, Shobhita Sundaram, Lucy Chai, Richard Zhang, Tali Dekel, and Phillip Isola. Dreamsim: 9 Learning new dimensions of human visual similarity using synthetic data. InNeurIPS, 2023. 2, 3, 5

  23. [23]

    Deep residual learning for image recognition

    Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. InCVPR, 2016. 2, 3

  24. [24]

    The role of similarity in categorization: Providing a groundwork.Cognition, 1994

    Robert L Goldstone. The role of similarity in categorization: Providing a groundwork.Cognition, 1994. 2

  25. [25]

    Multimodal datasets: misogyny, pornography, and ma- lignant stereotypes.arXiv, 2021

    Abeba Birhane, Vinay Uday Prabhu, and Emmanuel Kahem- bwe. Multimodal datasets: misogyny, pornography, and ma- lignant stereotypes.arXiv, 2021. 2

  26. [26]

    Semdedup: Data-efficient learning at web-scale through semantic deduplication.arXiv, 2023

    Amro Abbas, Kushal Tirumala, D´aniel Simig, Surya Ganguli, and Ari S Morcos. Semdedup: Data-efficient learning at web-scale through semantic deduplication.arXiv, 2023. 2

  27. [27]

    Similarity.The Oxford Handbook of Thinking and Reasoning, 2012

    Robert L Goldstone and Ji Yun Son. Similarity.The Oxford Handbook of Thinking and Reasoning, 2012. 3

  28. [28]

    Studies of similarity

    Amos Tversky and Itamar Gati. Studies of similarity. In Cognition and categorization, 2024

  29. [29]

    Concepts and similarity

    Ulrike Hahn and Nick Chater. Concepts and similarity. In Knowledge concepts and categories, 2013. 3

  30. [30]

    Tversky loss function for image segmentation using 3d fully convolutional deep networks

    Seyed Sadegh Mohseni Salehi, Deniz Erdogmus, and Ali Gholipour. Tversky loss function for image segmentation using 3d fully convolutional deep networks. InInternational workshop on machine learning in medical imaging, 2017. 3

  31. [31]

    Stephen hawking: ‘there is no heaven; it’s a fairy story’.The Guardian, 2011

    Ian Sample. Stephen hawking: ‘there is no heaven; it’s a fairy story’.The Guardian, 2011. Accessed: 2025-11-09. 3

  32. [32]

    Image quality assessment: from error visibility to structural similarity.IEEE transactions on image processing,

    Zhou Wang, Alan C Bovik, Hamid R Sheikh, and Eero P Simoncelli. Image quality assessment: from error visibility to structural similarity.IEEE transactions on image processing,

  33. [33]

    Fsim: A feature similarity index for image quality assessment.IEEE transactions on Image Processing, 2011

    Lin Zhang, Lei Zhang, Xuanqin Mou, and David Zhang. Fsim: A feature similarity index for image quality assessment.IEEE transactions on Image Processing, 2011. 3

  34. [34]

    Pieapp: Perceptual image-error assessment through pairwise preference

    Ekta Prashnani, Hong Cai, Yasamin Mostofi, and Pradeep Sen. Pieapp: Perceptual image-error assessment through pairwise preference. InCVPR, 2018. 3

  35. [35]

    Simoncelli

    Keyan Ding, Kede Ma, Shiqi Wang, and Eero P. Simoncelli. Image quality assessment: Unifying structure and texture similarity.CoRR, 2020. 3

  36. [36]

    An image is worth 16x16 words: Trans- formers for image recognition at scale.arXiv, 2020

    Alexey Dosovitskiy. An image is worth 16x16 words: Trans- formers for image recognition at scale.arXiv, 2020. 3

  37. [37]

    Sigmoid loss for language image pre-training

    Xiaohua Zhai, Basil Mustafa, Alexander Kolesnikov, and Lucas Beyer. Sigmoid loss for language image pre-training. InCVPR, 2023. 3

  38. [38]

    Visual instruction tuning.NeurIPS, 2023

    Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning.NeurIPS, 2023. 3

  39. [39]

    Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2. 5-vl technical report.arXiv, 2025. 4, 5, 6, 11

  40. [40]

    Gpt-4o system card.arXiv,

    Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Welihinda, Alan Hayes, Alec Radford, et al. Gpt-4o system card.arXiv,

  41. [41]

    Emerging properties in unified multimodal pre- training.arXiv, 2025

    Chaorui Deng, Deyao Zhu, Kunchang Li, Chenhui Gou, Feng Li, Zeyu Wang, Shu Zhong, Weihao Yu, Xiaonan Nie, Ziang Song, et al. Emerging properties in unified multimodal pre- training.arXiv, 2025. 8

  42. [42]

    X-fusion: Intro- ducing new modality to frozen large language models

    Sicheng Mo, Thao Nguyen, Xun Huang, Siddharth Srinivasan Iyer, Yijun Li, Yuchen Liu, Abhishek Tandon, Eli Shechtman, Krishna Kumar Singh, Yong Jae Lee, et al. X-fusion: Intro- ducing new modality to frozen large language models. In ICCV, 2025

  43. [43]

    Yo’chameleon: Personalized vision and language generation

    Thao Nguyen, Krishna Kumar Singh, Jing Shi, Trung Bui, Yong Jae Lee, and Yuheng Li. Yo’chameleon: Personalized vision and language generation. InCVPR, 2025

  44. [44]

    Gemini: a family of highly capable multimodal models.arXiv, 2023

    Gemini Team, Rohan Anil, Sebastian Borgeaud, Jean- Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, Katie Millican, et al. Gemini: a family of highly capable multimodal models.arXiv, 2023

  45. [45]

    Yo’llava: Your personalized language and vision assistant

    Thao Nguyen, Haotian Liu, Yuheng Li, Mu Cai, Utkarsh Ojha, and Yong Jae Lee. Yo’llava: Your personalized language and vision assistant. InNeurIPS, 2024

  46. [46]

    Improved baselines with visual instruction tuning

    Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved baselines with visual instruction tuning. InCVPR,

  47. [47]

    Sentence-bert: Sentence embeddings using siamese bert-networks.arXiv, 2019

    Nils Reimers and Iryna Gurevych. Sentence-bert: Sentence embeddings using siamese bert-networks.arXiv, 2019. 5

  48. [48]

    Lora: Low-rank adaptation of large language models

    Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, Weizhu Chen, et al. Lora: Low-rank adaptation of large language models. InICLR,

  49. [49]

    Stephen Batifol, Andreas Blattmann, Frederic Boesel, Sak- sham Consul, Cyril Diagne, Tim Dockhorn, Jack English, Zion English, Patrick Esser, Sumith Kulal, et al. Flux. 1 kontext: Flow matching for in-context image generation and editing in latent space.arXiv, 2025. 8

  50. [50]

    Qwen-image technical report.arXiv, 2025

    Chenfei Wu, Jiahao Li, Jingren Zhou, Junyang Lin, Kaiyuan Gao, Kun Yan, Sheng-ming Yin, Shuai Bai, Xiao Xu, Yilei Chen, et al. Qwen-image technical report.arXiv, 2025. 8

  51. [51]

    Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities.arXiv, 2025

    Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blistein, Ori Ram, Dan Zhang, Evan Rosen, et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities.arXiv, 2025. 8 10 Relational Visual Similarity Supplementary Material

  52. [52]

    Yes”. If the image is not interesting, answer “No

    Implementation Details This section presents implementation details as well as snap- shots of the training data and model predictions, which were omitted from the main paper due to page constraints. Interesting images filtering prompt You are an expert in visual creativity and interesting- ness. Your task is to determine if the given image is visually int...

  53. [53]

    Additional Results Additional image retrieval results can be found in Fig. 14-15 12 Query 26 27 30 52 Nearest Neighbors QueryNearest Neighbors QueryNearest Neighbors Query dreamsim DINO CLIP-I Qwen-T Ours dreamsim DINO CLIP-I Qwen-T Ours dreamsim DINO CLIP-I Qwen-T Ours dreamsim DINO CLIP-I Qwen-T Ours dreamsim DINO CLIP-I Qwen-T Ours dreamsim DINO CLIP-I...