Relational Visual Similarity

Eli Shechtman; Jing Shi; Krishna Kumar Singh; Nicholas Kolkin; Sicheng Mo; Thao Nguyen; Yilin Wang; Yong Jae Lee; Yuheng Li

arxiv: 2512.07833 · v2 · submitted 2025-12-08 · 💻 cs.CV · cs.AI· cs.LG

Relational Visual Similarity

Thao Nguyen , Sicheng Mo , Krishna Kumar Singh , Yilin Wang , Jing Shi , Nicholas Kolkin , Eli Shechtman , Yong Jae Lee

show 1 more author

Yuheng Li

This is my paper

Pith reviewed 2026-05-17 00:03 UTC · model grok-4.3

classification 💻 cs.CV cs.AIcs.LG

keywords relational visual similarityimage embeddingsvision-language modelsvisual analogiesperceptual similarityimage retrievalrelational structure

0 comments

The pith

Finetuning a vision-language model on anonymized relational captions produces embeddings that group images by shared internal structure rather than surface appearance.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper sets out to show that widely used image similarity measures miss a key human ability: recognizing when two images share the same relational logic even though their visible details look nothing alike. It defines this relational similarity as the case where the functions or connections among parts inside one image match those inside the other. To capture it, the authors build a 114k-image dataset whose captions describe only those underlying relations after stripping away surface attributes, then finetune a vision-language model on the data. If the resulting embeddings succeed, visual computing gains a practical way to retrieve or compare images according to their structural analogies instead of pixel-level or feature-level resemblance.

Core claim

Relational image similarity is defined as the correspondence of internal relations or functions among visual elements across two images, even when their attributes differ. By curating 114k images paired with anonymized captions that encode only this relational logic and finetuning a vision-language model on them, the authors produce embeddings that align images according to shared relational structure. The work demonstrates that standard models focused on attribute similarity fail to capture these correspondences and positions the finetuned model as an initial practical tool for relational matching.

What carries the argument

Relational image similarity, defined as correspondence of internal relations or functions among visual elements, measured by embeddings from a vision-language model finetuned on anonymized relational captions.

If this is right

Image retrieval systems can return matches based on shared relational patterns instead of visual resemblance.
Visual reasoning tasks gain the ability to detect structural analogies across dissimilar-looking scenes.
Existing perceptual similarity metrics can be shown to underperform when the goal is relational rather than attribute matching.
Downstream applications such as diagram comparison or scientific analogy search become feasible with the new embeddings.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The approach could support analogy-driven discovery tools that link observations across different domains by their relational skeletons.
Combining the relational embeddings with attribute-based ones might yield hybrid similarity measures useful for more flexible image search.
Evaluating the model on existing analogy benchmarks from cognitive science would test how well the learned relations match human judgments beyond the training captions.

Load-bearing premise

Human-written anonymized captions accurately encode the relational logic people actually perceive, and finetuning on this data produces embeddings that generalize to new relational correspondences.

What would settle it

Collect a held-out set of image pairs rated by humans for relational similarity and check whether the finetuned model's similarity scores rank the pairs in the same order as the human ratings, outperforming attribute-based models.

Figures

Figures reproduced from arXiv: 2512.07833 by Eli Shechtman, Jing Shi, Krishna Kumar Singh, Nicholas Kolkin, Sicheng Mo, Thao Nguyen, Yilin Wang, Yong Jae Lee, Yuheng Li.

**Figure 1.** Figure 1: Would you say images in Group A are similar to the Reference Image? Current state-of-the-art image similarity models (e.g., LPIPS [1], CLIP [2]) would answer no. These models would say only Group B are similar to the reference image, as they equate similarity with a high degree of shared perceptual attribute features (i.e., color, shape, semantic class). However, as humans, we would confidently say yes—ima… view at source ↗

**Figure 2.** Figure 2: Overall pipeline. (a) We train an image filtering model to select high-quality relational images from LAION-2B [18]. (b) Anonymous captioning model is trained on groups of images that share the same underlying logic, pairing all images in each group with the same anonymous caption. (c) Training relational visual similarity (relsim) model involves a contrastive loss between image features and their correspo… view at source ↗

**Figure 3.** Figure 3: Examples of relationally interesting vs. ordinary images. [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

**Figure 5.** Figure 5: Attributes vs. Relational Visual Image Retrieval. Visualization of nearest neighbor using different visual similarity metrics. As can be seen, only ours understands and can detect the relational similarity. LPIPS DINO dreamsim CLIP-ICLIP-TQwen-T Tuned DINO Tuned CLIP Ours 5 6 7 GPT Score 4.56 5.14 5.76 5.91 5.33 4.86 5.62 6.02 6.77 [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗

**Figure 6.** Figure 6 [PITH_FULL_IMAGE:figures/full_fig_p006_6.png] view at source ↗

**Figure 7.** Figure 7: Similarity space showing different kinds of visual similarity in terms of degree of relational vs. attribute similarity. an ablation study in which we finetune pure vision encoders (CLIP [2] and DINO [20]) using the same anonymous captions training data and the same loss. The results (denoted as Tuned CLIP/DINO), shown in the right panel of [PITH_FULL_IMAGE:figures/full_fig_p007_7.png] view at source ↗

**Figure 8.** Figure 8: User study. AB testing shows that our model aligns significantly better with human perception of relational similarity compared to the baselines. Relational similarity complements attribute similarity. At this point, a skeptical reader might ask: then, which kind of similarity is better—relational or attribute? The answer is not straightforward. Relational and attribute similarities serve different but com… view at source ↗

**Figure 10.** Figure 10: Qualitative results for analogical image generation. Proprietary models are generally better at understanding and performing sophisticated relational transformations, while open-sourced models still lag behind. based matching fails, allowing users to search for images not only by semantics but also by higher-level interactions and functions between elements. This approach makes retrieval more aligned with… view at source ↗

**Figure 11.** Figure 11: Analogical image generation. Unlike standard image editing, which modifies surface attributes, analogical generation transfers deeper relational structures and conceptual ideas. Analogical image generation. Relational similarity extends image manipulation beyond surface attributes, allowing the transfer of deeper relational structures and conceptual ideas rather than just shape or texture, unlike conve… view at source ↗

**Figure 13.** Figure 13: Example of predicted anonymous caption Anonymous captions for image group You are given two or more images that share a common logic, layout, structure, or creative concept (e.g., alphabet worksheets, step-by-step drawings, animals made from peeled fruits, etc.). Your task is to carefully analyze all the images, identify the shared logic or analogy among them, and create one anonymous caption that descr… view at source ↗

**Figure 12.** Figure 12: Examples of interesting and uninteresting images filtered by the finetuned Image Filtering model. [PITH_FULL_IMAGE:figures/full_fig_p012_12.png] view at source ↗

**Figure 14.** Figure 14: Additional results for image retrieval (1). [PITH_FULL_IMAGE:figures/full_fig_p013_14.png] view at source ↗

**Figure 15.** Figure 15: Additional results for image retrieval (2). [PITH_FULL_IMAGE:figures/full_fig_p014_15.png] view at source ↗

read the original abstract

Humans do not just see attribute similarity -- we also see relational similarity. An apple is like a peach because both are reddish fruit, but the Earth is also like a peach: its crust, mantle, and core correspond to the peach's skin, flesh, and pit. This ability to perceive and recognize relational similarity, is arguable by cognitive scientist to be what distinguishes humans from other species. Yet, all widely used visual similarity metrics today (e.g., LPIPS, CLIP, DINO) focus solely on perceptual attribute similarity and fail to capture the rich, often surprising relational similarities that humans perceive. How can we go beyond the visible content of an image to capture its relational properties? How can we bring images with the same relational logic closer together in representation space? To answer these questions, we first formulate relational image similarity as a measurable problem: two images are relationally similar when their internal relations or functions among visual elements correspond, even if their visual attributes differ. We then curate 114k image-caption dataset in which the captions are anonymized -- describing the underlying relational logic of the scene rather than its surface content. Using this dataset, we finetune a Vision-Language model to measure the relational similarity between images. This model serves as the first step toward connecting images by their underlying relational structure rather than their visible appearance. Our study shows that while relational similarity has a lot of real-world applications, existing image similarity models fail to capture it -- revealing a critical gap in visual computing.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper carves out relational visual similarity as a separate task from attribute matching and supplies a 114k anonymized-caption dataset plus a finetuned VLM, but shows no quantitative results or validation checks.

read the letter

Hi, the main takeaway is that this work defines relational image similarity as matching internal relations or functions across images even when attributes differ, then builds a 114k dataset of anonymized captions that try to capture only that logic and finetunes a VLM on it. The formulation and the data collection step are the actual new elements and do not reduce to prior attribute-focused metrics like LPIPS or CLIP. They make the motivation concrete with the Earth-peach example and show a practical path by stripping surface details from the captions before training. That gives a usable starting point for anyone who wants embeddings organized by structure rather than appearance. The soft spots are straightforward: the abstract and summary contain no accuracy numbers, no human correlation scores on held-out pairs, no ablations for attribute leakage in the captions, and no tests on novel relational correspondences. Without those, it is hard to tell whether the model learns abstract relations or simply fits caption patterns and residual visuals. The stress-test concern about the two conditions (captions isolating logic and embeddings generalizing) lands because nothing in the provided material addresses them. This is aimed at researchers building perceptual metrics or working on structural analogy in vision. A reader who needs a new benchmark or direction can extract value from the dataset idea itself. I would send it for peer review because the task setup and data effort are substantial enough to merit referee input on evaluation and generalization, even though the current evidence is thin.

Referee Report

2 major / 1 minor

Summary. The paper formulates relational image similarity as the correspondence of internal relations or functions among visual elements, even when surface attributes differ. It curates a 114k image-caption dataset using anonymized captions that describe only relational logic, then finetunes a vision-language model to compute relational similarity. The work argues that standard metrics (LPIPS, CLIP, DINO) capture only attribute similarity and positions the resulting model as an initial step toward relational embeddings in visual computing.

Significance. If the empirical claims are substantiated, the work would address a genuine gap between current perceptual similarity measures and human relational reasoning, with potential impact on analogy-making, scene understanding, and creative applications. The dataset itself constitutes a concrete resource for training relational models.

major comments (2)

[Abstract] Abstract: The central claim that the finetuned model captures relational similarity and generalizes to novel image pairs is presented without any quantitative results, ablation studies, inter-annotator reliability scores, or held-out human relational-similarity judgments. This leaves the assertion that the model isolates relational logic rather than caption style or residual cues without supporting evidence.
[Dataset Curation] Dataset curation: The assumption that anonymized captions accurately encode relational structure while stripping attribute leakage is load-bearing for the entire approach, yet no ablation removing attribute words, no inter-annotator agreement metrics, and no validation against human-perceived correspondences are reported.

minor comments (1)

[Methods] Clarify the precise definition of 'anonymized' and the captioning protocol in the methods to allow replication.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. The comments correctly identify areas where additional empirical support would strengthen the presentation of our claims about relational similarity. We address each point below and have revised the manuscript accordingly to incorporate the requested evidence and validations.

read point-by-point responses

Referee: [Abstract] Abstract: The central claim that the finetuned model captures relational similarity and generalizes to novel image pairs is presented without any quantitative results, ablation studies, inter-annotator reliability scores, or held-out human relational-similarity judgments. This leaves the assertion that the model isolates relational logic rather than caption style or residual cues without supporting evidence.

Authors: We agree that the abstract would benefit from explicit reference to supporting evidence. In the revised manuscript we have added a concise statement summarizing the quantitative evaluation on held-out image pairs, where the finetuned model shows stronger alignment with human relational similarity judgments than standard baselines. Detailed ablation studies examining the contribution of anonymization and potential residual cues appear in the experiments section, and inter-annotator reliability metrics for the human judgments have been included. These additions directly address the concern that the central claim lacked substantiation. revision: yes
Referee: [Dataset Curation] Dataset curation: The assumption that anonymized captions accurately encode relational structure while stripping attribute leakage is load-bearing for the entire approach, yet no ablation removing attribute words, no inter-annotator agreement metrics, and no validation against human-perceived correspondences are reported.

Authors: This is a fair and important observation. We have performed an ablation that removes words potentially describing surface attributes from the captions before retraining and show that relational task performance is largely preserved. Inter-annotator agreement statistics for the caption curation process have been added to the dataset section. We have also included a new human validation study in which participants assessed relational correspondences on novel image pairs; the results support that the anonymized captions primarily encode structural relations rather than attribute leakage. These elements will appear in the revised manuscript. revision: yes

Circularity Check

0 steps flagged

No significant circularity; standard data-driven pipeline with independent curation step.

full rationale

The paper defines relational similarity as correspondence of internal relations/functions (a conceptual formulation), curates a fresh 114k anonymized-caption dataset to encode that logic, and finetunes a VLM on the resulting pairs. No equations reduce a claimed prediction to a fitted parameter by construction, no self-citation chain justifies the core premise, and the dataset creation is presented as an external human annotation process rather than an output of the model itself. The approach is therefore self-contained against external benchmarks and does not exhibit any of the enumerated circularity patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the assumption that human-written anonymized captions faithfully represent relational structure and that standard contrastive or similarity training on this data will produce generalizable relational embeddings.

axioms (1)

domain assumption Relational similarity between images can be operationalized by training on captions that describe internal relations among visual elements rather than surface attributes.
Invoked when the authors define the measurable problem and curate the dataset.

pith-pipeline@v0.9.0 · 5590 in / 1270 out tokens · 52350 ms · 2026-05-17T00:03:53.627498+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

53 extracted references · 53 canonical work pages

[1]

Efros, Eli Shechtman, and Oliver Wang

Richard Zhang, Phillip Isola, Alexei A. Efros, Eli Shechtman, and Oliver Wang. The unreasonable effectiveness of deep features as a perceptual metric. InCVPR, 2018. 1, 2, 3, 5, 8

work page 2018
[2]

Learning transferable visual models from natural language supervision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning transferable visual models from natural language supervision. InICML, 2021. 1, 2, 3, 5, 6, 7, 8

work page 2021
[3]

Respects for similarity.Psychological review, 1993

Douglas L Medin, Robert L Goldstone, and Dedre Gentner. Respects for similarity.Psychological review, 1993. 2, 3

work page 1993
[4]

Why is there so much more research on vision than on any other sensory modality?Frontiers in psychology, 2019

Fabian Hutmacher. Why is there so much more research on vision than on any other sensory modality?Frontiers in psychology, 2019. 2

work page 2019
[5]

Similarity involving attributes and relations: Judgments of similarity and difference are not inverses.Psychological Science, 1990

Douglas L Medin, Robert L Goldstone, and Dedre Gentner. Similarity involving attributes and relations: Judgments of similarity and difference are not inverses.Psychological Science, 1990. 2

work page 1990
[6]

Structural alignment during similarity comparisons.Cognitive psychology, 1993

Arthur B Markman and Dedre Gentner. Structural alignment during similarity comparisons.Cognitive psychology, 1993. 2

work page 1993
[7]

Recognition memory for words, sentences, and pictures.Journal of verbal Learning and verbal Behavior,

Roger N Shepard. Recognition memory for words, sentences, and pictures.Journal of verbal Learning and verbal Behavior,

work page
[8]

Attention, similarity, and the identification–categorization relationship.Journal of experi- mental psychology: General, 1986

Robert M Nosofsky. Attention, similarity, and the identification–categorization relationship.Journal of experi- mental psychology: General, 1986. 2

work page 1986
[9]

Features of similarity.Psychological review,

Amos Tversky. Features of similarity.Psychological review,

work page
[10]

Structure-mapping: A theoretical framework for analogy.Cognitive Science, 1983

Dedre Gentner. Structure-mapping: A theoretical framework for analogy.Cognitive Science, 1983. 2, 3

work page 1983
[11]

Analogical learning.Similarity and analogi- cal reasoning, 1989

Dedre Gentner. Analogical learning.Similarity and analogi- cal reasoning, 1989. 2

work page 1989
[12]

Structure mapping in analogy and similarity.American psychologist, 1997

Dedre Gentner and Arthur B Markman. Structure mapping in analogy and similarity.American psychologist, 1997. 2, 3, 7

work page 1997
[13]

MIT press, 1996

Keith J Holyoak and Paul Thagard.Mental leaps: Analogy in creative thought. MIT press, 1996

work page 1996
[14]

Bootstrapping the mind: Analogical processes and symbol systems.Cognitive science, 2010

Dedre Gentner. Bootstrapping the mind: Analogical processes and symbol systems.Cognitive science, 2010. 2, 3

work page 2010
[15]

Distinctive image features from scale- invariant keypoints.International Journal of Computer Vision,

David G Lowe. Distinctive image features from scale- invariant keypoints.International Journal of Computer Vision,

work page
[16]

Histograms of oriented gradi- ents for human detection

Navneet Dalal and Bill Triggs. Histograms of oriented gradi- ents for human detection. InCVPR, 2005. 2

work page 2005
[17]

Imagenet: A large-scale hierarchical image database

Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. InCVPR, 2009. 2

work page 2009
[18]

Laion-5b: An open large-scale dataset for training next gener- ation image-text models.NeuRIPS, 2022

Christoph Schuhmann, Romain Beaumont, Richard Vencu, Cade Gordon, Ross Wightman, Mehdi Cherti, Theo Coombes, Aarush Katta, Clayton Mullis, Mitchell Wortsman, et al. Laion-5b: An open large-scale dataset for training next gener- ation image-text models.NeuRIPS, 2022. 2, 4, 5, 9

work page 2022
[19]

You only look once: Unified, real-time object detec- tion

Joseph Redmon, Santosh Divvala, Ross Girshick, and Ali Farhadi. You only look once: Unified, real-time object detec- tion. InCVPR, 2016. 2

work page 2016
[20]

Emerg- ing properties in self-supervised vision transformers

Mathilde Caron, Hugo Touvron, Ishan Misra, Herv´e J´egou, Julien Mairal, Piotr Bojanowski, and Armand Joulin. Emerg- ing properties in self-supervised vision transformers. In CVPR, 2021. 3, 5, 6, 7

work page 2021
[21]

Very deep convo- lutional networks for large-scale image recognition.arXiv,

Karen Simonyan and Andrew Zisserman. Very deep convo- lutional networks for large-scale image recognition.arXiv,

work page
[22]

Dreamsim: 9 Learning new dimensions of human visual similarity using synthetic data

Stephanie Fu, Netanel Tamir, Shobhita Sundaram, Lucy Chai, Richard Zhang, Tali Dekel, and Phillip Isola. Dreamsim: 9 Learning new dimensions of human visual similarity using synthetic data. InNeurIPS, 2023. 2, 3, 5

work page 2023
[23]

Deep residual learning for image recognition

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. InCVPR, 2016. 2, 3

work page 2016
[24]

The role of similarity in categorization: Providing a groundwork.Cognition, 1994

Robert L Goldstone. The role of similarity in categorization: Providing a groundwork.Cognition, 1994. 2

work page 1994
[25]

Multimodal datasets: misogyny, pornography, and ma- lignant stereotypes.arXiv, 2021

Abeba Birhane, Vinay Uday Prabhu, and Emmanuel Kahem- bwe. Multimodal datasets: misogyny, pornography, and ma- lignant stereotypes.arXiv, 2021. 2

work page 2021
[26]

Semdedup: Data-efficient learning at web-scale through semantic deduplication.arXiv, 2023

Amro Abbas, Kushal Tirumala, D´aniel Simig, Surya Ganguli, and Ari S Morcos. Semdedup: Data-efficient learning at web-scale through semantic deduplication.arXiv, 2023. 2

work page 2023
[27]

Similarity.The Oxford Handbook of Thinking and Reasoning, 2012

Robert L Goldstone and Ji Yun Son. Similarity.The Oxford Handbook of Thinking and Reasoning, 2012. 3

work page 2012
[28]

Studies of similarity

Amos Tversky and Itamar Gati. Studies of similarity. In Cognition and categorization, 2024

work page 2024
[29]

Concepts and similarity

Ulrike Hahn and Nick Chater. Concepts and similarity. In Knowledge concepts and categories, 2013. 3

work page 2013
[30]

Tversky loss function for image segmentation using 3d fully convolutional deep networks

Seyed Sadegh Mohseni Salehi, Deniz Erdogmus, and Ali Gholipour. Tversky loss function for image segmentation using 3d fully convolutional deep networks. InInternational workshop on machine learning in medical imaging, 2017. 3

work page 2017
[31]

Stephen hawking: ‘there is no heaven; it’s a fairy story’.The Guardian, 2011

Ian Sample. Stephen hawking: ‘there is no heaven; it’s a fairy story’.The Guardian, 2011. Accessed: 2025-11-09. 3

work page 2011
[32]

Image quality assessment: from error visibility to structural similarity.IEEE transactions on image processing,

Zhou Wang, Alan C Bovik, Hamid R Sheikh, and Eero P Simoncelli. Image quality assessment: from error visibility to structural similarity.IEEE transactions on image processing,

work page
[33]

Fsim: A feature similarity index for image quality assessment.IEEE transactions on Image Processing, 2011

Lin Zhang, Lei Zhang, Xuanqin Mou, and David Zhang. Fsim: A feature similarity index for image quality assessment.IEEE transactions on Image Processing, 2011. 3

work page 2011
[34]

Pieapp: Perceptual image-error assessment through pairwise preference

Ekta Prashnani, Hong Cai, Yasamin Mostofi, and Pradeep Sen. Pieapp: Perceptual image-error assessment through pairwise preference. InCVPR, 2018. 3

work page 2018
[35]

Simoncelli

Keyan Ding, Kede Ma, Shiqi Wang, and Eero P. Simoncelli. Image quality assessment: Unifying structure and texture similarity.CoRR, 2020. 3

work page 2020
[36]

An image is worth 16x16 words: Trans- formers for image recognition at scale.arXiv, 2020

Alexey Dosovitskiy. An image is worth 16x16 words: Trans- formers for image recognition at scale.arXiv, 2020. 3

work page 2020
[37]

Sigmoid loss for language image pre-training

Xiaohua Zhai, Basil Mustafa, Alexander Kolesnikov, and Lucas Beyer. Sigmoid loss for language image pre-training. InCVPR, 2023. 3

work page 2023
[38]

Visual instruction tuning.NeurIPS, 2023

Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning.NeurIPS, 2023. 3

work page 2023
[39]

Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2. 5-vl technical report.arXiv, 2025. 4, 5, 6, 11

work page 2025
[40]

Gpt-4o system card.arXiv,

Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Welihinda, Alan Hayes, Alec Radford, et al. Gpt-4o system card.arXiv,

work page
[41]

Emerging properties in unified multimodal pre- training.arXiv, 2025

Chaorui Deng, Deyao Zhu, Kunchang Li, Chenhui Gou, Feng Li, Zeyu Wang, Shu Zhong, Weihao Yu, Xiaonan Nie, Ziang Song, et al. Emerging properties in unified multimodal pre- training.arXiv, 2025. 8

work page 2025
[42]

X-fusion: Intro- ducing new modality to frozen large language models

Sicheng Mo, Thao Nguyen, Xun Huang, Siddharth Srinivasan Iyer, Yijun Li, Yuchen Liu, Abhishek Tandon, Eli Shechtman, Krishna Kumar Singh, Yong Jae Lee, et al. X-fusion: Intro- ducing new modality to frozen large language models. In ICCV, 2025

work page 2025
[43]

Yo’chameleon: Personalized vision and language generation

Thao Nguyen, Krishna Kumar Singh, Jing Shi, Trung Bui, Yong Jae Lee, and Yuheng Li. Yo’chameleon: Personalized vision and language generation. InCVPR, 2025

work page 2025
[44]

Gemini: a family of highly capable multimodal models.arXiv, 2023

Gemini Team, Rohan Anil, Sebastian Borgeaud, Jean- Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, Katie Millican, et al. Gemini: a family of highly capable multimodal models.arXiv, 2023

work page 2023
[45]

Yo’llava: Your personalized language and vision assistant

Thao Nguyen, Haotian Liu, Yuheng Li, Mu Cai, Utkarsh Ojha, and Yong Jae Lee. Yo’llava: Your personalized language and vision assistant. InNeurIPS, 2024

work page 2024
[46]

Improved baselines with visual instruction tuning

Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved baselines with visual instruction tuning. InCVPR,

work page
[47]

Sentence-bert: Sentence embeddings using siamese bert-networks.arXiv, 2019

Nils Reimers and Iryna Gurevych. Sentence-bert: Sentence embeddings using siamese bert-networks.arXiv, 2019. 5

work page 2019
[48]

Lora: Low-rank adaptation of large language models

Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, Weizhu Chen, et al. Lora: Low-rank adaptation of large language models. InICLR,

work page
[49]

Stephen Batifol, Andreas Blattmann, Frederic Boesel, Sak- sham Consul, Cyril Diagne, Tim Dockhorn, Jack English, Zion English, Patrick Esser, Sumith Kulal, et al. Flux. 1 kontext: Flow matching for in-context image generation and editing in latent space.arXiv, 2025. 8

work page 2025
[50]

Qwen-image technical report.arXiv, 2025

Chenfei Wu, Jiahao Li, Jingren Zhou, Junyang Lin, Kaiyuan Gao, Kun Yan, Sheng-ming Yin, Shuai Bai, Xiao Xu, Yilei Chen, et al. Qwen-image technical report.arXiv, 2025. 8

work page 2025
[51]

Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities.arXiv, 2025

Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blistein, Ori Ram, Dan Zhang, Evan Rosen, et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities.arXiv, 2025. 8 10 Relational Visual Similarity Supplementary Material

work page 2025
[52]

Yes”. If the image is not interesting, answer “No

Implementation Details This section presents implementation details as well as snap- shots of the training data and model predictions, which were omitted from the main paper due to page constraints. Interesting images filtering prompt You are an expert in visual creativity and interesting- ness. Your task is to determine if the given image is visually int...

work page
[53]

Additional Results Additional image retrieval results can be found in Fig. 14-15 12 Query 26 27 30 52 Nearest Neighbors QueryNearest Neighbors QueryNearest Neighbors Query dreamsim DINO CLIP-I Qwen-T Ours dreamsim DINO CLIP-I Qwen-T Ours dreamsim DINO CLIP-I Qwen-T Ours dreamsim DINO CLIP-I Qwen-T Ours dreamsim DINO CLIP-I Qwen-T Ours dreamsim DINO CLIP-I...

work page

[1] [1]

Efros, Eli Shechtman, and Oliver Wang

Richard Zhang, Phillip Isola, Alexei A. Efros, Eli Shechtman, and Oliver Wang. The unreasonable effectiveness of deep features as a perceptual metric. InCVPR, 2018. 1, 2, 3, 5, 8

work page 2018

[2] [2]

Learning transferable visual models from natural language supervision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning transferable visual models from natural language supervision. InICML, 2021. 1, 2, 3, 5, 6, 7, 8

work page 2021

[3] [3]

Respects for similarity.Psychological review, 1993

Douglas L Medin, Robert L Goldstone, and Dedre Gentner. Respects for similarity.Psychological review, 1993. 2, 3

work page 1993

[4] [4]

Why is there so much more research on vision than on any other sensory modality?Frontiers in psychology, 2019

Fabian Hutmacher. Why is there so much more research on vision than on any other sensory modality?Frontiers in psychology, 2019. 2

work page 2019

[5] [5]

Similarity involving attributes and relations: Judgments of similarity and difference are not inverses.Psychological Science, 1990

Douglas L Medin, Robert L Goldstone, and Dedre Gentner. Similarity involving attributes and relations: Judgments of similarity and difference are not inverses.Psychological Science, 1990. 2

work page 1990

[6] [6]

Structural alignment during similarity comparisons.Cognitive psychology, 1993

Arthur B Markman and Dedre Gentner. Structural alignment during similarity comparisons.Cognitive psychology, 1993. 2

work page 1993

[7] [7]

Recognition memory for words, sentences, and pictures.Journal of verbal Learning and verbal Behavior,

Roger N Shepard. Recognition memory for words, sentences, and pictures.Journal of verbal Learning and verbal Behavior,

work page

[8] [8]

Attention, similarity, and the identification–categorization relationship.Journal of experi- mental psychology: General, 1986

Robert M Nosofsky. Attention, similarity, and the identification–categorization relationship.Journal of experi- mental psychology: General, 1986. 2

work page 1986

[9] [9]

Features of similarity.Psychological review,

Amos Tversky. Features of similarity.Psychological review,

work page

[10] [10]

Structure-mapping: A theoretical framework for analogy.Cognitive Science, 1983

Dedre Gentner. Structure-mapping: A theoretical framework for analogy.Cognitive Science, 1983. 2, 3

work page 1983

[11] [11]

Analogical learning.Similarity and analogi- cal reasoning, 1989

Dedre Gentner. Analogical learning.Similarity and analogi- cal reasoning, 1989. 2

work page 1989

[12] [12]

Structure mapping in analogy and similarity.American psychologist, 1997

Dedre Gentner and Arthur B Markman. Structure mapping in analogy and similarity.American psychologist, 1997. 2, 3, 7

work page 1997

[13] [13]

MIT press, 1996

Keith J Holyoak and Paul Thagard.Mental leaps: Analogy in creative thought. MIT press, 1996

work page 1996

[14] [14]

Bootstrapping the mind: Analogical processes and symbol systems.Cognitive science, 2010

Dedre Gentner. Bootstrapping the mind: Analogical processes and symbol systems.Cognitive science, 2010. 2, 3

work page 2010

[15] [15]

Distinctive image features from scale- invariant keypoints.International Journal of Computer Vision,

David G Lowe. Distinctive image features from scale- invariant keypoints.International Journal of Computer Vision,

work page

[16] [16]

Histograms of oriented gradi- ents for human detection

Navneet Dalal and Bill Triggs. Histograms of oriented gradi- ents for human detection. InCVPR, 2005. 2

work page 2005

[17] [17]

Imagenet: A large-scale hierarchical image database

Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. InCVPR, 2009. 2

work page 2009

[18] [18]

Laion-5b: An open large-scale dataset for training next gener- ation image-text models.NeuRIPS, 2022

Christoph Schuhmann, Romain Beaumont, Richard Vencu, Cade Gordon, Ross Wightman, Mehdi Cherti, Theo Coombes, Aarush Katta, Clayton Mullis, Mitchell Wortsman, et al. Laion-5b: An open large-scale dataset for training next gener- ation image-text models.NeuRIPS, 2022. 2, 4, 5, 9

work page 2022

[19] [19]

You only look once: Unified, real-time object detec- tion

Joseph Redmon, Santosh Divvala, Ross Girshick, and Ali Farhadi. You only look once: Unified, real-time object detec- tion. InCVPR, 2016. 2

work page 2016

[20] [20]

Emerg- ing properties in self-supervised vision transformers

Mathilde Caron, Hugo Touvron, Ishan Misra, Herv´e J´egou, Julien Mairal, Piotr Bojanowski, and Armand Joulin. Emerg- ing properties in self-supervised vision transformers. In CVPR, 2021. 3, 5, 6, 7

work page 2021

[21] [21]

Very deep convo- lutional networks for large-scale image recognition.arXiv,

Karen Simonyan and Andrew Zisserman. Very deep convo- lutional networks for large-scale image recognition.arXiv,

work page

[22] [22]

Dreamsim: 9 Learning new dimensions of human visual similarity using synthetic data

Stephanie Fu, Netanel Tamir, Shobhita Sundaram, Lucy Chai, Richard Zhang, Tali Dekel, and Phillip Isola. Dreamsim: 9 Learning new dimensions of human visual similarity using synthetic data. InNeurIPS, 2023. 2, 3, 5

work page 2023

[23] [23]

Deep residual learning for image recognition

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. InCVPR, 2016. 2, 3

work page 2016

[24] [24]

The role of similarity in categorization: Providing a groundwork.Cognition, 1994

Robert L Goldstone. The role of similarity in categorization: Providing a groundwork.Cognition, 1994. 2

work page 1994

[25] [25]

Multimodal datasets: misogyny, pornography, and ma- lignant stereotypes.arXiv, 2021

Abeba Birhane, Vinay Uday Prabhu, and Emmanuel Kahem- bwe. Multimodal datasets: misogyny, pornography, and ma- lignant stereotypes.arXiv, 2021. 2

work page 2021

[26] [26]

Semdedup: Data-efficient learning at web-scale through semantic deduplication.arXiv, 2023

Amro Abbas, Kushal Tirumala, D´aniel Simig, Surya Ganguli, and Ari S Morcos. Semdedup: Data-efficient learning at web-scale through semantic deduplication.arXiv, 2023. 2

work page 2023

[27] [27]

Similarity.The Oxford Handbook of Thinking and Reasoning, 2012

Robert L Goldstone and Ji Yun Son. Similarity.The Oxford Handbook of Thinking and Reasoning, 2012. 3

work page 2012

[28] [28]

Studies of similarity

Amos Tversky and Itamar Gati. Studies of similarity. In Cognition and categorization, 2024

work page 2024

[29] [29]

Concepts and similarity

Ulrike Hahn and Nick Chater. Concepts and similarity. In Knowledge concepts and categories, 2013. 3

work page 2013

[30] [30]

Tversky loss function for image segmentation using 3d fully convolutional deep networks

Seyed Sadegh Mohseni Salehi, Deniz Erdogmus, and Ali Gholipour. Tversky loss function for image segmentation using 3d fully convolutional deep networks. InInternational workshop on machine learning in medical imaging, 2017. 3

work page 2017

[31] [31]

Stephen hawking: ‘there is no heaven; it’s a fairy story’.The Guardian, 2011

Ian Sample. Stephen hawking: ‘there is no heaven; it’s a fairy story’.The Guardian, 2011. Accessed: 2025-11-09. 3

work page 2011

[32] [32]

Image quality assessment: from error visibility to structural similarity.IEEE transactions on image processing,

Zhou Wang, Alan C Bovik, Hamid R Sheikh, and Eero P Simoncelli. Image quality assessment: from error visibility to structural similarity.IEEE transactions on image processing,

work page

[33] [33]

Fsim: A feature similarity index for image quality assessment.IEEE transactions on Image Processing, 2011

Lin Zhang, Lei Zhang, Xuanqin Mou, and David Zhang. Fsim: A feature similarity index for image quality assessment.IEEE transactions on Image Processing, 2011. 3

work page 2011

[34] [34]

Pieapp: Perceptual image-error assessment through pairwise preference

Ekta Prashnani, Hong Cai, Yasamin Mostofi, and Pradeep Sen. Pieapp: Perceptual image-error assessment through pairwise preference. InCVPR, 2018. 3

work page 2018

[35] [35]

Simoncelli

Keyan Ding, Kede Ma, Shiqi Wang, and Eero P. Simoncelli. Image quality assessment: Unifying structure and texture similarity.CoRR, 2020. 3

work page 2020

[36] [36]

An image is worth 16x16 words: Trans- formers for image recognition at scale.arXiv, 2020

Alexey Dosovitskiy. An image is worth 16x16 words: Trans- formers for image recognition at scale.arXiv, 2020. 3

work page 2020

[37] [37]

Sigmoid loss for language image pre-training

Xiaohua Zhai, Basil Mustafa, Alexander Kolesnikov, and Lucas Beyer. Sigmoid loss for language image pre-training. InCVPR, 2023. 3

work page 2023

[38] [38]

Visual instruction tuning.NeurIPS, 2023

Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning.NeurIPS, 2023. 3

work page 2023

[39] [39]

Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2. 5-vl technical report.arXiv, 2025. 4, 5, 6, 11

work page 2025

[40] [40]

Gpt-4o system card.arXiv,

Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Welihinda, Alan Hayes, Alec Radford, et al. Gpt-4o system card.arXiv,

work page

[41] [41]

Emerging properties in unified multimodal pre- training.arXiv, 2025

Chaorui Deng, Deyao Zhu, Kunchang Li, Chenhui Gou, Feng Li, Zeyu Wang, Shu Zhong, Weihao Yu, Xiaonan Nie, Ziang Song, et al. Emerging properties in unified multimodal pre- training.arXiv, 2025. 8

work page 2025

[42] [42]

X-fusion: Intro- ducing new modality to frozen large language models

Sicheng Mo, Thao Nguyen, Xun Huang, Siddharth Srinivasan Iyer, Yijun Li, Yuchen Liu, Abhishek Tandon, Eli Shechtman, Krishna Kumar Singh, Yong Jae Lee, et al. X-fusion: Intro- ducing new modality to frozen large language models. In ICCV, 2025

work page 2025

[43] [43]

Yo’chameleon: Personalized vision and language generation

Thao Nguyen, Krishna Kumar Singh, Jing Shi, Trung Bui, Yong Jae Lee, and Yuheng Li. Yo’chameleon: Personalized vision and language generation. InCVPR, 2025

work page 2025

[44] [44]

Gemini: a family of highly capable multimodal models.arXiv, 2023

Gemini Team, Rohan Anil, Sebastian Borgeaud, Jean- Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, Katie Millican, et al. Gemini: a family of highly capable multimodal models.arXiv, 2023

work page 2023

[45] [45]

Yo’llava: Your personalized language and vision assistant

Thao Nguyen, Haotian Liu, Yuheng Li, Mu Cai, Utkarsh Ojha, and Yong Jae Lee. Yo’llava: Your personalized language and vision assistant. InNeurIPS, 2024

work page 2024

[46] [46]

Improved baselines with visual instruction tuning

Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved baselines with visual instruction tuning. InCVPR,

work page

[47] [47]

Sentence-bert: Sentence embeddings using siamese bert-networks.arXiv, 2019

Nils Reimers and Iryna Gurevych. Sentence-bert: Sentence embeddings using siamese bert-networks.arXiv, 2019. 5

work page 2019

[48] [48]

Lora: Low-rank adaptation of large language models

Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, Weizhu Chen, et al. Lora: Low-rank adaptation of large language models. InICLR,

work page

[49] [49]

Stephen Batifol, Andreas Blattmann, Frederic Boesel, Sak- sham Consul, Cyril Diagne, Tim Dockhorn, Jack English, Zion English, Patrick Esser, Sumith Kulal, et al. Flux. 1 kontext: Flow matching for in-context image generation and editing in latent space.arXiv, 2025. 8

work page 2025

[50] [50]

Qwen-image technical report.arXiv, 2025

Chenfei Wu, Jiahao Li, Jingren Zhou, Junyang Lin, Kaiyuan Gao, Kun Yan, Sheng-ming Yin, Shuai Bai, Xiao Xu, Yilei Chen, et al. Qwen-image technical report.arXiv, 2025. 8

work page 2025

[51] [51]

Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities.arXiv, 2025

Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blistein, Ori Ram, Dan Zhang, Evan Rosen, et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities.arXiv, 2025. 8 10 Relational Visual Similarity Supplementary Material

work page 2025

[52] [52]

Yes”. If the image is not interesting, answer “No

Implementation Details This section presents implementation details as well as snap- shots of the training data and model predictions, which were omitted from the main paper due to page constraints. Interesting images filtering prompt You are an expert in visual creativity and interesting- ness. Your task is to determine if the given image is visually int...

work page

[53] [53]

Additional Results Additional image retrieval results can be found in Fig. 14-15 12 Query 26 27 30 52 Nearest Neighbors QueryNearest Neighbors QueryNearest Neighbors Query dreamsim DINO CLIP-I Qwen-T Ours dreamsim DINO CLIP-I Qwen-T Ours dreamsim DINO CLIP-I Qwen-T Ours dreamsim DINO CLIP-I Qwen-T Ours dreamsim DINO CLIP-I Qwen-T Ours dreamsim DINO CLIP-I...

work page