Composed Vision-Language Retrieval for Skin Cancer Case Search via Joint Alignment of Global and Local Representations
Pith reviewed 2026-05-15 13:50 UTC · model grok-4.3
The pith
Transformer framework with joint global-local alignment improves retrieval of skin cancer cases from image-text queries
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
A transformer framework learns hierarchical composed query representations and performs joint global-local alignment between queries and candidate images, where local alignment aggregates discriminative regions via multiple spatial attention masks and global alignment provides holistic semantic supervision, with final similarity computed through convex domain-informed weighting that emphasizes clinically salient local evidence.
What carries the argument
Joint global-local alignment using transformer-based hierarchical representations, multiple spatial attention masks for local feature aggregation, and convex domain-informed weighting for similarity scores
Load-bearing premise
The joint global-local alignment and domain-informed weighting will produce clinically meaningful gains that generalize beyond the Derm7pt dataset and its specific imaging conditions.
What would settle it
Evaluating the model on an independent skin lesion dataset collected under different imaging protocols or patient demographics and finding no improvement or a drop in retrieval accuracy relative to strong baselines would falsify the central claim.
Figures
read the original abstract
Medical image retrieval aims to identify clinically relevant lesion cases to support diagnostic decision making, education, and quality control. In practice, retrieval queries often combine a reference lesion image with textual descriptors such as dermoscopic features. We study composed vision-language retrieval for skin cancer, where each query consists of an image to text pair and the database contains biopsy-confirmed, multi-class disease cases. We propose a transformer based framework that learns hierarchical composed query representations and performs joint global-local alignment between queries and candidate images. Local alignment aggregates discriminative regions via multiple spatial attention masks, while global alignment provides holistic semantic supervision. The final similarity is computed through a convex, domain-informed weighting that emphasizes clinically salient local evidence while preserving global consistency. Experiments on the public Derm7pt dataset demonstrate consistent improvements over state-of-the-art methods. The proposed framework enables efficient access to relevant medical records and supports practical clinical deployment.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes a transformer-based framework for composed vision-language retrieval in skin cancer case search. Each query is an image-text pair (reference lesion image plus dermoscopic textual descriptors), and the database consists of biopsy-confirmed multi-class cases. The method learns hierarchical composed query representations, performs joint global-local alignment (multiple spatial attention masks for local discriminative regions plus global semantic supervision), and computes similarity via a convex domain-informed weighting scheme. Experiments on the public Derm7pt dataset are reported to yield consistent improvements over state-of-the-art methods.
Significance. If the experimental claims hold under proper statistical validation, the work could meaningfully advance medical image retrieval by better integrating visual and textual cues for clinically relevant lesion matching. The joint alignment and domain-informed weighting address a practical need in dermatology for retrieving similar cases to support diagnosis, education, and quality control.
major comments (1)
- Experiments section: the reported improvements over SOTA on Derm7pt rest on single-run point estimates for mAP and Recall@K without error bars, multiple random seeds, or paired statistical significance tests. Because the framework involves a hierarchical transformer, multiple attention masks, and learned convex weighting, run-to-run variability could easily account for the observed deltas; this absence directly weakens the central claim that the joint alignment mechanisms produce reliable gains.
minor comments (2)
- The method description would benefit from an explicit equation or pseudocode block showing how the convex domain-informed weights are computed and combined with the global and local similarity scores.
- Clarify the exact train/validation/test splits used on Derm7pt and whether any domain-specific preprocessing (e.g., lesion segmentation) is applied before the transformer.
Simulated Author's Rebuttal
We thank the referee for the detailed and constructive feedback. We agree that stronger statistical validation is needed to support the experimental claims and will incorporate the suggested changes in the revised manuscript.
read point-by-point responses
-
Referee: Experiments section: the reported improvements over SOTA on Derm7pt rest on single-run point estimates for mAP and Recall@K without error bars, multiple random seeds, or paired statistical significance tests. Because the framework involves a hierarchical transformer, multiple attention masks, and learned convex weighting, run-to-run variability could easily account for the observed deltas; this absence directly weakens the central claim that the joint alignment mechanisms produce reliable gains.
Authors: We acknowledge the validity of this concern. The current results are indeed based on single-run evaluations. In the revised manuscript, we will re-run all experiments using at least five different random seeds, report mean and standard deviation for mAP and Recall@K metrics, and include paired statistical significance tests (e.g., Wilcoxon signed-rank test) comparing our method against each baseline. These additions will quantify variability and confirm that the observed improvements from the hierarchical query representations and joint global-local alignment are statistically reliable. revision: yes
Circularity Check
No circularity: framework is architectural description without equations or self-referential reductions
full rationale
The paper describes a transformer-based composed retrieval framework using hierarchical query representations, multiple spatial attention masks for local alignment, global semantic supervision, and convex domain-informed weighting. No equations, derivations, or parameter-fitting steps are presented that could reduce to self-definition, fitted-input predictions, or self-citation chains. The central claim rests on empirical comparisons to SOTA methods on the Derm7pt dataset, which are independent of any internal loop in the method description. This matches the reader's assessment that no equations or derivations are shown and the framework does not reduce to circular definitions.
Axiom & Free-Parameter Ledger
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
The final similarity is computed through a convex, domain-informed weighting that emphasizes clinically salient local evidence while preserving global consistency. ... S=β S_local + (1−β)S_global where β∈[0,1]
-
IndisputableMonolith/Foundation/DimensionForcing.leanalexander_duality_circle_linking unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We use a hierarchical vision backbone built on the Swin Transformer to extract multi-level visual features
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Dermoscopy image analysis: overview and future directions,
M Emre Celebi, Noel Codella, and Allan Halpern, “Dermoscopy image analysis: overview and future directions,”IEEE journal of biomedical and health informatics, vol. 23, no. 2, pp. 474–478, 2019
work page 2019
-
[2]
Multimodal image dataset for ai- based skin cancer (midas) benchmarking,
Albert S Chiou, Jesutofunmi A Omiye, Haiwen Gui, Susan M Swetter, Justin M Ko, Brian Gastman, Joshua Arbesman, Zhuo Ran Cai, Olivier Gevaert, Christoph Sad ´ee, et al., “Multimodal image dataset for ai- based skin cancer (midas) benchmarking,”NEJM AI, vol. 2, no. 6, pp. AIdbp2400732, 2025. W ANGet al. 5
work page 2025
-
[3]
Dermatologist-level classification of skin cancer with deep neural networks,
Andre Esteva, Brett Kuprel, Roberto A Novoa, Justin Ko, Susan M Swetter, Helen M Blau, and Sebastian Thrun, “Dermatologist-level classification of skin cancer with deep neural networks,”nature, vol. 542, no. 7639, pp. 115–118, 2017
work page 2017
-
[4]
A reinforcement learning model for ai-based decision support in skin cancer,
Catarina Barata, Veronica Rotemberg, Noel CF Codella, Philipp Tschandl, Christoph Rinner, Bengu Nisa Akay, Zoe Apalla, Giuseppe Argenziano, Allan Halpern, Aimilios Lallas, et al., “A reinforcement learning model for ai-based decision support in skin cancer,”Nature Medicine, vol. 29, no. 8, pp. 1941–1946, 2023
work page 1941
-
[5]
A multimodal vision foundation model for clinical dermatology,
Siyuan Yan, Zhen Yu, Clare Primiero, Cristina Vico-Alonso, Zhonghua Wang, Litao Yang, Philipp Tschandl, Ming Hu, Lie Ju, Gin Tan, et al., “A multimodal vision foundation model for clinical dermatology,”Nature Medicine, pp. 1–12, 2025
work page 2025
-
[6]
Neural memory self-supervised state space models with learnable gates,
Zhihua Wang, Yuxin He, Zhang Yi, Tao He, and Jiajun Bu, “Neural memory self-supervised state space models with learnable gates,”IEEE Signal Processing Letters, 2025
work page 2025
-
[7]
Automated triage of cancer-suspicious skin lesions with 3d total-body photography,
Nicholas R Kurtansky, Maura C Gillis, Noel CF Codella, Brian M D’Alessandro, Zongyuan Ge, Pascale Guitera, Allan C Halpern, Harald Kittler, Josep Malvehy, Konstantinos Liopyris, et al., “Automated triage of cancer-suspicious skin lesions with 3d total-body photography,”npj Digital Medicine, vol. 8, no. 1, pp. 708, 2025
work page 2025
-
[8]
Philipp Tschandl, Giuseppe Argenziano, Majid Razmara, and Jordan Yap, “Diagnostic accuracy of content-based dermatoscopic image re- trieval with deep classification features,”British Journal of Dermatology, vol. 181, no. 1, pp. 155–165, 2019
work page 2019
-
[9]
Dermoscopic image retrieval based on rotation- invariance deep hashing,
Yilan Zhang, Fengying Xie, Xuedong Song, Yushan Zheng, Jie Liu, and Juncheng Wang, “Dermoscopic image retrieval based on rotation- invariance deep hashing,”Medical Image Analysis, vol. 77, pp. 102301, 2022
work page 2022
-
[10]
Erdal ¨Ozbay and Feyza Altunbey ¨Ozbay, “Interpretable pap-smear image retrieval for cervical cancer detection with rotation invariance mask generation deep hashing,”Computers in Biology and Medicine, vol. 154, pp. 106574, 2023
work page 2023
-
[11]
Multi- scale triplet hashing for medical image retrieval,
Yaxiong Chen, Yibo Tang, Jinghao Huang, and Shengwu Xiong, “Multi- scale triplet hashing for medical image retrieval,”Computers in Biology and Medicine, vol. 155, pp. 106633, 2023
work page 2023
-
[12]
Deep hashing with walsh domain for multi-label image retrieval,
Yinqi Chen, Peiwen Li, Yangting Zheng, Weijian Luo, and Xiang Gao, “Deep hashing with walsh domain for multi-label image retrieval,”IEEE Signal Processing Letters, vol. 32, pp. 861–865, 2025
work page 2025
-
[13]
Yuheng Wang, Nandinee Fariah Haq, Jiayue Cai, Sunil Kalia, Harvey Lui, Z Jane Wang, and Tim K Lee, “Multi-channel content based image retrieval method for skin diseases using similarity network fusion and deep community analysis,”Biomedical Signal Processing and Control, vol. 78, pp. 103893, 2022
work page 2022
-
[14]
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin, “Attention is all you need,”Advances in neural information processing systems, vol. 30, 2017
work page 2017
-
[15]
BERT: a review of applications in natural language processing and understanding,
Mikhail V Koroteev, “Bert: a review of applications in natural language processing and understanding,”arXiv preprint arXiv:2103.11943, 2021
-
[16]
Laila Rasmy, Yang Xiang, Ziqian Xie, Cui Tao, and Degui Zhi, “Med- bert: pretrained contextualized embeddings on large-scale structured electronic health records for disease prediction,”NPJ digital medicine, vol. 4, no. 1, pp. 86, 2021
work page 2021
-
[17]
Pre-trained multimodal large language model enhances derma- tological diagnosis using skingpt-4,
Juexiao Zhou, Xiaonan He, Liyuan Sun, Jiannan Xu, Xiuying Chen, Yuetan Chu, Longxi Zhou, Xingyu Liao, Bin Zhang, Shawn Afvari, et al., “Pre-trained multimodal large language model enhances derma- tological diagnosis using skingpt-4,”Nature Communications, vol. 15, no. 1, pp. 5649, 2024
work page 2024
-
[18]
Swin transformer: Hierarchical vision transformer using shifted windows,
Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo, “Swin transformer: Hierarchical vision transformer using shifted windows,” inProceedings of the IEEE/CVF international conference on computer vision, 2021, pp. 10012–10022
work page 2021
-
[19]
Multi-modal transformer with global-local alignment for com- posed query image retrieval,
Yahui Xu, Yi Bin, Jiwei Wei, Yang Yang, Guoqing Wang, and Heng Tao Shen, “Multi-modal transformer with global-local alignment for com- posed query image retrieval,”IEEE Transactions on Multimedia, vol. 25, pp. 8346–8357, 2023
work page 2023
-
[20]
Rongtao Xu, Changwei Wang, Jiguang Zhang, Shibiao Xu, Weiliang Meng, and Xiaopeng Zhang, “Skinformer: Learning statistical texture representation with transformer for skin lesion segmentation,”IEEE Journal of Biomedical and Health Informatics, vol. 28, no. 10, pp. 6008– 6018, 2024
work page 2024
-
[21]
Madhavi Latha Pandala and S Periyanayagi, “Optimal explainable vision transformer framework for skin cancer diagnosis with neural architecture search feature learning,”Biomedical Signal Processing and Control, vol. 112, pp. 108723, 2026
work page 2026
-
[22]
Composed image retrieval via cross relation network with hierarchical aggregation transformer,
Qu Yang, Mang Ye, Zhaohui Cai, Kehua Su, and Bo Du, “Composed image retrieval via cross relation network with hierarchical aggregation transformer,”IEEE Transactions on Image Processing, vol. 32, pp. 4543–4554, 2023
work page 2023
-
[23]
A comprehensive survey on composed image retrieval,
Xuemeng Song, Haoqiang Lin, Haokun Wen, Bohan Hou, Mingzhu Xu, and Liqiang Nie, “A comprehensive survey on composed image retrieval,”ACM Transactions on Information Systems, vol. 44, no. 1, pp. 1–54, 2025
work page 2025
-
[24]
Textbridge: A text-centered framework for enhanced multimodal inte- gration and retrieval,
Jie Guo, Wenwei Wang, Haiyang Jing, Bin Song, and Minghao Wang, “Textbridge: A text-centered framework for enhanced multimodal inte- gration and retrieval,”IEEE Transactions on Multimedia, 2026
work page 2026
-
[25]
RoBERTa: A Robustly Optimized BERT Pretraining Approach
Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov, “Roberta: A robustly optimized bert pretraining approach,”arXiv preprint arXiv:1907.11692, 2019
work page internal anchor Pith review Pith/arXiv arXiv 1907
-
[26]
Yuheng Wang, Tianze Yu, Jiayue Cai, Sunil Kalia, Harvey Lui, Z Jane Wang, and Tim K Lee, “Integrating clinical knowledge graphs and gradient-based neural systems for enhanced melanoma diagnosis via the seven-point checklist,”IEEE Transactions on Neural Networks and Learning Systems, 2025
work page 2025
-
[27]
Unsupervised visual similarity-based medical image retrieval via dual-encoder and metric learning,
Xiya Weng, Yan Zhuang, Rui Wang, Ke Chen, Lin Han, Zhan Hua, and Jiangli Lin, “Unsupervised visual similarity-based medical image retrieval via dual-encoder and metric learning,”Neurocomputing, vol. 634, pp. 129861, 2025
work page 2025
-
[28]
Saeed Iqbal, Adnan N Qureshi, Musaed Alhussein, Imran Arshad Choudhry, Khursheed Aurangzeb, and Tariq M Khan, “Fusion of textural and visual information for medical image modality retrieval using deep learning-based feature engineering,”IEEE Access, vol. 11, pp. 93238– 93253, 2023
work page 2023
-
[29]
Global meets local: Dual activation hashing network for large-scale fine-grained image retrieval,
Xin Jiang, Hao Tang, and Zechao Li, “Global meets local: Dual activation hashing network for large-scale fine-grained image retrieval,” IEEE Transactions on Knowledge and Data Engineering, vol. 36, no. 11, pp. 6266–6279, 2024
work page 2024
-
[30]
Seven-point checklist and skin lesion classification using multitask multimodal neural nets,
Jeremy Kawahara, Sara Daneshvar, Giuseppe Argenziano, and Ghassan Hamarneh, “Seven-point checklist and skin lesion classification using multitask multimodal neural nets,”IEEE journal of biomedical and health informatics, vol. 23, no. 2, pp. 538–546, 2018
work page 2018
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.