Relational Visual Similarity
Pith reviewed 2026-05-17 00:03 UTC · model grok-4.3
The pith
Finetuning a vision-language model on anonymized relational captions produces embeddings that group images by shared internal structure rather than surface appearance.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Relational image similarity is defined as the correspondence of internal relations or functions among visual elements across two images, even when their attributes differ. By curating 114k images paired with anonymized captions that encode only this relational logic and finetuning a vision-language model on them, the authors produce embeddings that align images according to shared relational structure. The work demonstrates that standard models focused on attribute similarity fail to capture these correspondences and positions the finetuned model as an initial practical tool for relational matching.
What carries the argument
Relational image similarity, defined as correspondence of internal relations or functions among visual elements, measured by embeddings from a vision-language model finetuned on anonymized relational captions.
If this is right
- Image retrieval systems can return matches based on shared relational patterns instead of visual resemblance.
- Visual reasoning tasks gain the ability to detect structural analogies across dissimilar-looking scenes.
- Existing perceptual similarity metrics can be shown to underperform when the goal is relational rather than attribute matching.
- Downstream applications such as diagram comparison or scientific analogy search become feasible with the new embeddings.
Where Pith is reading between the lines
- The approach could support analogy-driven discovery tools that link observations across different domains by their relational skeletons.
- Combining the relational embeddings with attribute-based ones might yield hybrid similarity measures useful for more flexible image search.
- Evaluating the model on existing analogy benchmarks from cognitive science would test how well the learned relations match human judgments beyond the training captions.
Load-bearing premise
Human-written anonymized captions accurately encode the relational logic people actually perceive, and finetuning on this data produces embeddings that generalize to new relational correspondences.
What would settle it
Collect a held-out set of image pairs rated by humans for relational similarity and check whether the finetuned model's similarity scores rank the pairs in the same order as the human ratings, outperforming attribute-based models.
Figures
read the original abstract
Humans do not just see attribute similarity -- we also see relational similarity. An apple is like a peach because both are reddish fruit, but the Earth is also like a peach: its crust, mantle, and core correspond to the peach's skin, flesh, and pit. This ability to perceive and recognize relational similarity, is arguable by cognitive scientist to be what distinguishes humans from other species. Yet, all widely used visual similarity metrics today (e.g., LPIPS, CLIP, DINO) focus solely on perceptual attribute similarity and fail to capture the rich, often surprising relational similarities that humans perceive. How can we go beyond the visible content of an image to capture its relational properties? How can we bring images with the same relational logic closer together in representation space? To answer these questions, we first formulate relational image similarity as a measurable problem: two images are relationally similar when their internal relations or functions among visual elements correspond, even if their visual attributes differ. We then curate 114k image-caption dataset in which the captions are anonymized -- describing the underlying relational logic of the scene rather than its surface content. Using this dataset, we finetune a Vision-Language model to measure the relational similarity between images. This model serves as the first step toward connecting images by their underlying relational structure rather than their visible appearance. Our study shows that while relational similarity has a lot of real-world applications, existing image similarity models fail to capture it -- revealing a critical gap in visual computing.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper formulates relational image similarity as the correspondence of internal relations or functions among visual elements, even when surface attributes differ. It curates a 114k image-caption dataset using anonymized captions that describe only relational logic, then finetunes a vision-language model to compute relational similarity. The work argues that standard metrics (LPIPS, CLIP, DINO) capture only attribute similarity and positions the resulting model as an initial step toward relational embeddings in visual computing.
Significance. If the empirical claims are substantiated, the work would address a genuine gap between current perceptual similarity measures and human relational reasoning, with potential impact on analogy-making, scene understanding, and creative applications. The dataset itself constitutes a concrete resource for training relational models.
major comments (2)
- [Abstract] Abstract: The central claim that the finetuned model captures relational similarity and generalizes to novel image pairs is presented without any quantitative results, ablation studies, inter-annotator reliability scores, or held-out human relational-similarity judgments. This leaves the assertion that the model isolates relational logic rather than caption style or residual cues without supporting evidence.
- [Dataset Curation] Dataset curation: The assumption that anonymized captions accurately encode relational structure while stripping attribute leakage is load-bearing for the entire approach, yet no ablation removing attribute words, no inter-annotator agreement metrics, and no validation against human-perceived correspondences are reported.
minor comments (1)
- [Methods] Clarify the precise definition of 'anonymized' and the captioning protocol in the methods to allow replication.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on our manuscript. The comments correctly identify areas where additional empirical support would strengthen the presentation of our claims about relational similarity. We address each point below and have revised the manuscript accordingly to incorporate the requested evidence and validations.
read point-by-point responses
-
Referee: [Abstract] Abstract: The central claim that the finetuned model captures relational similarity and generalizes to novel image pairs is presented without any quantitative results, ablation studies, inter-annotator reliability scores, or held-out human relational-similarity judgments. This leaves the assertion that the model isolates relational logic rather than caption style or residual cues without supporting evidence.
Authors: We agree that the abstract would benefit from explicit reference to supporting evidence. In the revised manuscript we have added a concise statement summarizing the quantitative evaluation on held-out image pairs, where the finetuned model shows stronger alignment with human relational similarity judgments than standard baselines. Detailed ablation studies examining the contribution of anonymization and potential residual cues appear in the experiments section, and inter-annotator reliability metrics for the human judgments have been included. These additions directly address the concern that the central claim lacked substantiation. revision: yes
-
Referee: [Dataset Curation] Dataset curation: The assumption that anonymized captions accurately encode relational structure while stripping attribute leakage is load-bearing for the entire approach, yet no ablation removing attribute words, no inter-annotator agreement metrics, and no validation against human-perceived correspondences are reported.
Authors: This is a fair and important observation. We have performed an ablation that removes words potentially describing surface attributes from the captions before retraining and show that relational task performance is largely preserved. Inter-annotator agreement statistics for the caption curation process have been added to the dataset section. We have also included a new human validation study in which participants assessed relational correspondences on novel image pairs; the results support that the anonymized captions primarily encode structural relations rather than attribute leakage. These elements will appear in the revised manuscript. revision: yes
Circularity Check
No significant circularity; standard data-driven pipeline with independent curation step.
full rationale
The paper defines relational similarity as correspondence of internal relations/functions (a conceptual formulation), curates a fresh 114k anonymized-caption dataset to encode that logic, and finetunes a VLM on the resulting pairs. No equations reduce a claimed prediction to a fitted parameter by construction, no self-citation chain justifies the core premise, and the dataset creation is presented as an external human annotation process rather than an output of the model itself. The approach is therefore self-contained against external benchmarks and does not exhibit any of the enumerated circularity patterns.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Relational similarity between images can be operationalized by training on captions that describe internal relations among visual elements rather than surface attributes.
Reference graph
Works this paper leans on
-
[1]
Efros, Eli Shechtman, and Oliver Wang
Richard Zhang, Phillip Isola, Alexei A. Efros, Eli Shechtman, and Oliver Wang. The unreasonable effectiveness of deep features as a perceptual metric. InCVPR, 2018. 1, 2, 3, 5, 8
work page 2018
-
[2]
Learning transferable visual models from natural language supervision
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning transferable visual models from natural language supervision. InICML, 2021. 1, 2, 3, 5, 6, 7, 8
work page 2021
-
[3]
Respects for similarity.Psychological review, 1993
Douglas L Medin, Robert L Goldstone, and Dedre Gentner. Respects for similarity.Psychological review, 1993. 2, 3
work page 1993
-
[4]
Fabian Hutmacher. Why is there so much more research on vision than on any other sensory modality?Frontiers in psychology, 2019. 2
work page 2019
-
[5]
Douglas L Medin, Robert L Goldstone, and Dedre Gentner. Similarity involving attributes and relations: Judgments of similarity and difference are not inverses.Psychological Science, 1990. 2
work page 1990
-
[6]
Structural alignment during similarity comparisons.Cognitive psychology, 1993
Arthur B Markman and Dedre Gentner. Structural alignment during similarity comparisons.Cognitive psychology, 1993. 2
work page 1993
-
[7]
Roger N Shepard. Recognition memory for words, sentences, and pictures.Journal of verbal Learning and verbal Behavior,
-
[8]
Robert M Nosofsky. Attention, similarity, and the identification–categorization relationship.Journal of experi- mental psychology: General, 1986. 2
work page 1986
-
[9]
Features of similarity.Psychological review,
Amos Tversky. Features of similarity.Psychological review,
-
[10]
Structure-mapping: A theoretical framework for analogy.Cognitive Science, 1983
Dedre Gentner. Structure-mapping: A theoretical framework for analogy.Cognitive Science, 1983. 2, 3
work page 1983
-
[11]
Analogical learning.Similarity and analogi- cal reasoning, 1989
Dedre Gentner. Analogical learning.Similarity and analogi- cal reasoning, 1989. 2
work page 1989
-
[12]
Structure mapping in analogy and similarity.American psychologist, 1997
Dedre Gentner and Arthur B Markman. Structure mapping in analogy and similarity.American psychologist, 1997. 2, 3, 7
work page 1997
-
[13]
Keith J Holyoak and Paul Thagard.Mental leaps: Analogy in creative thought. MIT press, 1996
work page 1996
-
[14]
Bootstrapping the mind: Analogical processes and symbol systems.Cognitive science, 2010
Dedre Gentner. Bootstrapping the mind: Analogical processes and symbol systems.Cognitive science, 2010. 2, 3
work page 2010
-
[15]
Distinctive image features from scale- invariant keypoints.International Journal of Computer Vision,
David G Lowe. Distinctive image features from scale- invariant keypoints.International Journal of Computer Vision,
-
[16]
Histograms of oriented gradi- ents for human detection
Navneet Dalal and Bill Triggs. Histograms of oriented gradi- ents for human detection. InCVPR, 2005. 2
work page 2005
-
[17]
Imagenet: A large-scale hierarchical image database
Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. InCVPR, 2009. 2
work page 2009
-
[18]
Laion-5b: An open large-scale dataset for training next gener- ation image-text models.NeuRIPS, 2022
Christoph Schuhmann, Romain Beaumont, Richard Vencu, Cade Gordon, Ross Wightman, Mehdi Cherti, Theo Coombes, Aarush Katta, Clayton Mullis, Mitchell Wortsman, et al. Laion-5b: An open large-scale dataset for training next gener- ation image-text models.NeuRIPS, 2022. 2, 4, 5, 9
work page 2022
-
[19]
You only look once: Unified, real-time object detec- tion
Joseph Redmon, Santosh Divvala, Ross Girshick, and Ali Farhadi. You only look once: Unified, real-time object detec- tion. InCVPR, 2016. 2
work page 2016
-
[20]
Emerg- ing properties in self-supervised vision transformers
Mathilde Caron, Hugo Touvron, Ishan Misra, Herv´e J´egou, Julien Mairal, Piotr Bojanowski, and Armand Joulin. Emerg- ing properties in self-supervised vision transformers. In CVPR, 2021. 3, 5, 6, 7
work page 2021
-
[21]
Very deep convo- lutional networks for large-scale image recognition.arXiv,
Karen Simonyan and Andrew Zisserman. Very deep convo- lutional networks for large-scale image recognition.arXiv,
-
[22]
Dreamsim: 9 Learning new dimensions of human visual similarity using synthetic data
Stephanie Fu, Netanel Tamir, Shobhita Sundaram, Lucy Chai, Richard Zhang, Tali Dekel, and Phillip Isola. Dreamsim: 9 Learning new dimensions of human visual similarity using synthetic data. InNeurIPS, 2023. 2, 3, 5
work page 2023
-
[23]
Deep residual learning for image recognition
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. InCVPR, 2016. 2, 3
work page 2016
-
[24]
The role of similarity in categorization: Providing a groundwork.Cognition, 1994
Robert L Goldstone. The role of similarity in categorization: Providing a groundwork.Cognition, 1994. 2
work page 1994
-
[25]
Multimodal datasets: misogyny, pornography, and ma- lignant stereotypes.arXiv, 2021
Abeba Birhane, Vinay Uday Prabhu, and Emmanuel Kahem- bwe. Multimodal datasets: misogyny, pornography, and ma- lignant stereotypes.arXiv, 2021. 2
work page 2021
-
[26]
Semdedup: Data-efficient learning at web-scale through semantic deduplication.arXiv, 2023
Amro Abbas, Kushal Tirumala, D´aniel Simig, Surya Ganguli, and Ari S Morcos. Semdedup: Data-efficient learning at web-scale through semantic deduplication.arXiv, 2023. 2
work page 2023
-
[27]
Similarity.The Oxford Handbook of Thinking and Reasoning, 2012
Robert L Goldstone and Ji Yun Son. Similarity.The Oxford Handbook of Thinking and Reasoning, 2012. 3
work page 2012
-
[28]
Amos Tversky and Itamar Gati. Studies of similarity. In Cognition and categorization, 2024
work page 2024
-
[29]
Ulrike Hahn and Nick Chater. Concepts and similarity. In Knowledge concepts and categories, 2013. 3
work page 2013
-
[30]
Tversky loss function for image segmentation using 3d fully convolutional deep networks
Seyed Sadegh Mohseni Salehi, Deniz Erdogmus, and Ali Gholipour. Tversky loss function for image segmentation using 3d fully convolutional deep networks. InInternational workshop on machine learning in medical imaging, 2017. 3
work page 2017
-
[31]
Stephen hawking: ‘there is no heaven; it’s a fairy story’.The Guardian, 2011
Ian Sample. Stephen hawking: ‘there is no heaven; it’s a fairy story’.The Guardian, 2011. Accessed: 2025-11-09. 3
work page 2011
-
[32]
Zhou Wang, Alan C Bovik, Hamid R Sheikh, and Eero P Simoncelli. Image quality assessment: from error visibility to structural similarity.IEEE transactions on image processing,
-
[33]
Lin Zhang, Lei Zhang, Xuanqin Mou, and David Zhang. Fsim: A feature similarity index for image quality assessment.IEEE transactions on Image Processing, 2011. 3
work page 2011
-
[34]
Pieapp: Perceptual image-error assessment through pairwise preference
Ekta Prashnani, Hong Cai, Yasamin Mostofi, and Pradeep Sen. Pieapp: Perceptual image-error assessment through pairwise preference. InCVPR, 2018. 3
work page 2018
-
[35]
Keyan Ding, Kede Ma, Shiqi Wang, and Eero P. Simoncelli. Image quality assessment: Unifying structure and texture similarity.CoRR, 2020. 3
work page 2020
-
[36]
An image is worth 16x16 words: Trans- formers for image recognition at scale.arXiv, 2020
Alexey Dosovitskiy. An image is worth 16x16 words: Trans- formers for image recognition at scale.arXiv, 2020. 3
work page 2020
-
[37]
Sigmoid loss for language image pre-training
Xiaohua Zhai, Basil Mustafa, Alexander Kolesnikov, and Lucas Beyer. Sigmoid loss for language image pre-training. InCVPR, 2023. 3
work page 2023
-
[38]
Visual instruction tuning.NeurIPS, 2023
Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning.NeurIPS, 2023. 3
work page 2023
-
[39]
Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2. 5-vl technical report.arXiv, 2025. 4, 5, 6, 11
work page 2025
-
[40]
Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Welihinda, Alan Hayes, Alec Radford, et al. Gpt-4o system card.arXiv,
-
[41]
Emerging properties in unified multimodal pre- training.arXiv, 2025
Chaorui Deng, Deyao Zhu, Kunchang Li, Chenhui Gou, Feng Li, Zeyu Wang, Shu Zhong, Weihao Yu, Xiaonan Nie, Ziang Song, et al. Emerging properties in unified multimodal pre- training.arXiv, 2025. 8
work page 2025
-
[42]
X-fusion: Intro- ducing new modality to frozen large language models
Sicheng Mo, Thao Nguyen, Xun Huang, Siddharth Srinivasan Iyer, Yijun Li, Yuchen Liu, Abhishek Tandon, Eli Shechtman, Krishna Kumar Singh, Yong Jae Lee, et al. X-fusion: Intro- ducing new modality to frozen large language models. In ICCV, 2025
work page 2025
-
[43]
Yo’chameleon: Personalized vision and language generation
Thao Nguyen, Krishna Kumar Singh, Jing Shi, Trung Bui, Yong Jae Lee, and Yuheng Li. Yo’chameleon: Personalized vision and language generation. InCVPR, 2025
work page 2025
-
[44]
Gemini: a family of highly capable multimodal models.arXiv, 2023
Gemini Team, Rohan Anil, Sebastian Borgeaud, Jean- Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, Katie Millican, et al. Gemini: a family of highly capable multimodal models.arXiv, 2023
work page 2023
-
[45]
Yo’llava: Your personalized language and vision assistant
Thao Nguyen, Haotian Liu, Yuheng Li, Mu Cai, Utkarsh Ojha, and Yong Jae Lee. Yo’llava: Your personalized language and vision assistant. InNeurIPS, 2024
work page 2024
-
[46]
Improved baselines with visual instruction tuning
Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved baselines with visual instruction tuning. InCVPR,
-
[47]
Sentence-bert: Sentence embeddings using siamese bert-networks.arXiv, 2019
Nils Reimers and Iryna Gurevych. Sentence-bert: Sentence embeddings using siamese bert-networks.arXiv, 2019. 5
work page 2019
-
[48]
Lora: Low-rank adaptation of large language models
Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, Weizhu Chen, et al. Lora: Low-rank adaptation of large language models. InICLR,
-
[49]
Stephen Batifol, Andreas Blattmann, Frederic Boesel, Sak- sham Consul, Cyril Diagne, Tim Dockhorn, Jack English, Zion English, Patrick Esser, Sumith Kulal, et al. Flux. 1 kontext: Flow matching for in-context image generation and editing in latent space.arXiv, 2025. 8
work page 2025
-
[50]
Qwen-image technical report.arXiv, 2025
Chenfei Wu, Jiahao Li, Jingren Zhou, Junyang Lin, Kaiyuan Gao, Kun Yan, Sheng-ming Yin, Shuai Bai, Xiao Xu, Yilei Chen, et al. Qwen-image technical report.arXiv, 2025. 8
work page 2025
-
[51]
Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blistein, Ori Ram, Dan Zhang, Evan Rosen, et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities.arXiv, 2025. 8 10 Relational Visual Similarity Supplementary Material
work page 2025
-
[52]
Yes”. If the image is not interesting, answer “No
Implementation Details This section presents implementation details as well as snap- shots of the training data and model predictions, which were omitted from the main paper due to page constraints. Interesting images filtering prompt You are an expert in visual creativity and interesting- ness. Your task is to determine if the given image is visually int...
-
[53]
Additional Results Additional image retrieval results can be found in Fig. 14-15 12 Query 26 27 30 52 Nearest Neighbors QueryNearest Neighbors QueryNearest Neighbors Query dreamsim DINO CLIP-I Qwen-T Ours dreamsim DINO CLIP-I Qwen-T Ours dreamsim DINO CLIP-I Qwen-T Ours dreamsim DINO CLIP-I Qwen-T Ours dreamsim DINO CLIP-I Qwen-T Ours dreamsim DINO CLIP-I...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.