A Sketch+Text Composed Image Retrieval Dataset for Thangka

Daomin Ji; Jiachen Li; Jiangling Zhang; Jinyu Xu; Qing Xie; Yanchun Ma; Yi Sun; Yongjian Liu; Zhifeng Bao

arxiv: 2602.08411 · v2 · submitted 2026-02-09 · 💻 cs.IR

A Sketch+Text Composed Image Retrieval Dataset for Thangka

Jinyu Xu , Yi Sun , Jiangling Zhang , Qing Xie , Daomin Ji , Zhifeng Bao , Jiachen Li , Yanchun Ma

show 1 more author

Yongjian Liu

This is my paper

Pith reviewed 2026-05-16 06:06 UTC · model grok-4.3

classification 💻 cs.IR

keywords Thangka imagerycomposed image retrievalsketch querieshierarchical textcultural heritagemultimodal datasetfine-grained retrievaldomain-specific benchmark

0 comments

The pith

Existing composed image retrieval methods struggle to align sketches and hierarchical text with fine-grained Thangka images without domain-specific training.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces CIRThan, a dataset of 2,287 Thangka images each paired with a human-drawn sketch and textual descriptions at three semantic levels. Thangka paintings contain dense symbolic structures and cultural conventions that demand precise structural and semantic matching. Standard CIR techniques developed on general imagery perform poorly when combining sketch abstractions with multi-level text on this material. The work supplies data splits, analysis, and baseline results to measure these gaps. It positions the dataset as a tool for improving retrieval in cultural heritage and other specialized visual domains.

Core claim

CIRThan supplies 2,287 high-quality Thangka images together with human sketches and hierarchical textual descriptions at three semantic levels, allowing composed queries that combine structural intent with multi-level semantic specification. Benchmark evaluations of representative supervised and zero-shot CIR methods show that existing approaches largely developed for general-domain imagery struggle to align sketch-based abstractions and hierarchical textual semantics with these fine-grained, symbolically dense images, particularly without in-domain supervision.

What carries the argument

The CIRThan dataset, which pairs each Thangka image with a sketch for structural abstraction and hierarchical text for multi-level semantic specification to support composed retrieval queries.

If this is right

Supervised in-domain training substantially outperforms zero-shot transfer on sketch-plus-text queries for structured cultural imagery.
Hierarchical textual descriptions are necessary to capture the dense symbolic elements that single-level text misses in Thangka retrieval.
New alignment techniques are required to map abstract sketches onto the intricate visual conventions of knowledge-specific image domains.
The dataset enables systematic testing of multimodal methods for other cultural-heritage retrieval tasks with similar structural complexity.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Methods that succeed on this benchmark may generalize to other domains with layered symbolic content such as technical diagrams or religious iconography.
Adding explicit cultural knowledge bases to the retrieval pipeline could address semantic conventions that sketches and text alone do not fully convey.
Varying sketch abstraction levels in follow-up experiments would clarify how much drawing precision is needed for reliable retrieval.
The same data-collection approach could be applied to create benchmarks for additional non-Western art traditions that share hierarchical visual grammars.

Load-bearing premise

The 2,287 selected Thangka images and their human annotations are representative enough of the full range of Thangka complexity and semantic conventions to serve as a reliable benchmark.

What would settle it

A general-domain CIR model achieving high retrieval accuracy on the CIRThan test split without any Thangka-specific training data or fine-tuning would show that the observed struggles are not inherent to the domain.

Figures

Figures reproduced from arXiv: 2602.08411 by Daomin Ji, Jiachen Li, Jiangling Zhang, Jinyu Xu, Qing Xie, Yanchun Ma, Yi Sun, Yongjian Liu, Zhifeng Bao.

**Figure 2.** Figure 2: The data construction pipeline of the CIRThan [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: Statistical analysis of image resolution and subject distribution in the dataset. [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

read the original abstract

Composed Image Retrieval (CIR) enables image retrieval by combining multiple query modalities, but existing benchmarks predominantly focus on general-domain imagery and rely on reference images with short textual modifications. As a result, they provide limited support for retrieval scenarios that require fine-grained semantic reasoning, structured visual understanding, and domain-specific knowledge. In this work, we introduce CIRThan, a sketch+text Composed Image Retrieval dataset for Thangka imagery, a culturally grounded and knowledge-specific visual domain characterized by complex structures, dense symbolic elements, and domain-dependent semantic conventions. CIRThan contains 2,287 high-quality Thangka images, each paired with a human-drawn sketch and hierarchical textual descriptions at three semantic levels, enabling composed queries that jointly express structural intent and multi-level semantic specification. We provide standardized data splits, comprehensive dataset analysis, and benchmark evaluations of representative supervised and zero-shot CIR methods. Experimental results reveal that existing CIR approaches, largely developed for general-domain imagery, struggle to effectively align sketch-based abstractions and hierarchical textual semantics with fine-grained Thangka images, particularly without in-domain supervision. We believe CIRThan offers a valuable benchmark for advancing sketch+text CIR, hierarchical semantic modeling, and multimodal retrieval in cultural heritage and other knowledge-specific visual domains. The dataset is publicly available at https://github.com/jinyuxu-whut/CIRThan.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

CIRThan gives a new sketch-plus-hierarchical-text benchmark for Thangka retrieval and shows general methods fall short, but the image selection lacks clear coverage stats.

read the letter

This paper's main contribution is the CIRThan dataset: 2,287 Thangka images, each with a human sketch and three-level text descriptions for composed queries. It fills a gap by targeting a domain with dense symbols and specific conventions that general CIR work ignores. The authors supply splits, some dataset stats, and baseline runs on supervised and zero-shot models, and the numbers show those models struggle without in-domain data. That part is useful and straightforward. The soft spot is the image selection. The paper describes the set as high-quality and culturally grounded but gives no numbers on motif coverage, style periods, or complexity distribution, so the performance gaps could partly reflect the particular sample rather than the whole domain. Annotation protocols and quality checks also get limited space. This work is for researchers building or testing multimodal retrieval in cultural heritage or other narrow visual domains. A reader who needs a fresh testbed for fine-grained sketch and text alignment will get value from the data and the initial results. It deserves peer review because releasing a targeted benchmark like this can push methods forward, even if the authors need to add more on how the images were chosen and validated. I would send it to referees.

Referee Report

2 major / 1 minor

Summary. The paper introduces CIRThan, a sketch+text composed image retrieval dataset for Thangka imagery containing 2,287 high-quality images each paired with a human-drawn sketch and hierarchical textual descriptions at three semantic levels. It supplies standardized splits, dataset analysis, and benchmark evaluations of supervised and zero-shot CIR methods, claiming that existing general-domain approaches struggle to align sketch abstractions and hierarchical textual semantics with fine-grained Thangka images, especially without in-domain supervision.

Significance. If the 2,287 images prove representative of Thangka structural density, symbolic conventions, and semantic hierarchies, the dataset would constitute a valuable benchmark for advancing sketch+text CIR, hierarchical semantic modeling, and multimodal retrieval in cultural heritage domains. The public release and initial results highlighting domain-specific gaps are constructive contributions.

major comments (2)

[Dataset construction] Dataset construction section: The selection of the 2,287 Thangka images is described as 'high-quality' and 'culturally grounded' but lacks quantitative coverage statistics over motif types, style periods, or compositional complexity. This is load-bearing for the central claim that observed performance gaps reflect intrinsic domain difficulty rather than selection bias.
[Benchmark evaluations] Benchmark evaluations section: Details on annotation protocols, inter-annotator agreement, quality control procedures, and precise evaluation metrics (e.g., recall@k definitions) are insufficient, limiting full verification of the reported performance claims.

minor comments (1)

[Dataset analysis] The manuscript would benefit from explicit clarification of the 'comprehensive dataset analysis' metrics and any tables summarizing motif or style distributions.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment point by point below, indicating the revisions we will incorporate to improve clarity and rigor.

read point-by-point responses

Referee: [Dataset construction] Dataset construction section: The selection of the 2,287 Thangka images is described as 'high-quality' and 'culturally grounded' but lacks quantitative coverage statistics over motif types, style periods, or compositional complexity. This is load-bearing for the central claim that observed performance gaps reflect intrinsic domain difficulty rather than selection bias.

Authors: We agree that quantitative coverage statistics would strengthen the argument for representativeness and help distinguish domain difficulty from potential selection effects. In the revised manuscript, we will add a dedicated subsection with tables reporting the distribution across motif types (e.g., percentages for Buddha figures, deities, mandalas, and narrative scenes), style periods (e.g., counts from major historical eras such as Yuan, Ming, and Qing), and compositional complexity metrics (e.g., average number of symbolic elements and structural layers per image). These statistics are available from our expert curation logs and will be presented to support the claim that the observed gaps arise from intrinsic Thangka characteristics. revision: yes
Referee: [Benchmark evaluations] Benchmark evaluations section: Details on annotation protocols, inter-annotator agreement, quality control procedures, and precise evaluation metrics (e.g., recall@k definitions) are insufficient, limiting full verification of the reported performance claims.

Authors: We acknowledge that expanded details on annotation and evaluation are required for reproducibility. In the revision, we will augment the relevant sections with: (1) explicit annotation protocols for sketch drawing and the three-level hierarchical texts, including guidelines provided to annotators; (2) inter-annotator agreement scores (e.g., Fleiss' kappa computed on a subset of text descriptions); (3) quality control procedures such as expert review rounds and resolution of disagreements; and (4) precise metric definitions, including the exact formulation of recall@k for composed sketch+text queries. These additions will allow full verification of the benchmark results. revision: yes

Circularity Check

0 steps flagged

No circularity: dataset introduction with direct empirical benchmarks

full rationale

The paper introduces the CIRThan dataset (2,287 Thangka images with sketches and three-level annotations) and reports benchmark results on existing CIR methods. No derivation chain, equations, fitted parameters, or predictions exist that reduce to self-defined inputs. Central claims rest on empirical evaluation rather than self-referential definitions or self-citation load-bearing. The representativeness assumption is a validity concern but does not create circularity in any derivation.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central contribution rests on the domain assumption that Thangka imagery possesses unique structural and semantic properties not captured by existing general-domain datasets.

axioms (1)

domain assumption Thangka images require domain-specific knowledge due to complex structures, dense symbolic elements, and domain-dependent semantic conventions.
Invoked in the abstract to justify why general CIR methods are insufficient.

pith-pipeline@v0.9.0 · 5560 in / 1207 out tokens · 53941 ms · 2026-05-16T06:06:27.339368+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

61 extracted references · 61 canonical work pages · 2 internal anchors

[1]

Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, Humen Zhong, Yuanzhi Zhu, Mingkun Yang, Zhaohai Li, Jianqiang Wan, Pengfei Wang, Wei Ding, Zheren Fu, Yiheng Xu, Jiabo Ye, Xi Zhang, Tianbao Xie, Zesen Cheng, Hang Zhang, Zhibo Yang, Haiyang Xu, and Junyang Lin. 2025. Qwen2.5-VL Technical Rep...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[2]

Alberto Baldrati, Lorenzo Agnolucci, Marco Bertini, and Alberto Del Bimbo. 2023. Zero-Shot Composed Image Retrieval with Textual Inversion. InProceedings of the IEEE/CVF International Conference on Computer Vision. 15292–15301

work page 2023
[3]

Alberto Baldrati, Marco Bertini, Tiberio Uricchio, and Alberto Del Bimbo. 2022. Conditioned and composed image retrieval combining and partially fine-tuning clip-based features. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 4959–4968

work page 2022
[4]

Alberto Baldrati, Marco Bertini, Tiberio Uricchio, and Alberto Del Bimbo. 2022. Effective conditioned and composed image retrieval combining clip-based fea- tures. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 21466–21474

work page 2022
[5]

Alberto Baldrati, Marco Bertini, Tiberio Uricchio, and Alberto Del Bimbo. 2023. Composed Image Retrieval using Contrastive Learning and Task-oriented CLIP- based Features.ACM Transactions on Multimedia Computing, Communications and Applications20, 3 (2023), 1–24

work page 2023
[6]

Tu Bui, Leonardo Ribeiro, Moacir Ponti, and John Collomosse. 2018. Sketching out the details: Sketch-based image retrieval using convolutional neural networks with multi-stage regression.Computers & Graphics71 (2018), 77–87

work page 2018
[7]

John Collomosse, Tu Bui, and Hailin Jin. 2019. LiveSketch: Query Perturbations for Guided Sketch-Based Visual Search. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2874–2882

work page 2019
[8]

Gaohuan Dong, Qing Xie, Jiachen Li, Yanchun Ma, Yuhan Liu, and Yongjian Liu

work page
[9]

In Proceedings of the 5th ACM International Conference on Multimedia in Asia

A multi-scale and dense object detector for tibetan thangka images. In Proceedings of the 5th ACM International Conference on Multimedia in Asia. 1–7

work page
[10]

Prajwal Gatti, Kshitij Parikh, Dhriti Prasanna Paul, Manish Gupta, and Anand Mishra. 2024. Composite Sketch+Text queries for retrieving objects with elu- sive names and complex interactions. InProceedings of the AAAI Conference on Artificial Intelligence, Vol. 38. 1869–1877

work page 2024
[11]

Geonmo Gu, Sanghyuk Chun, Wonjae Kim, Yoohoon Kang, and Sangdoo Yun

work page
[12]

InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recog- nition

Language-only Efficient Training of Zero-shot Composed Image Retrieval. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recog- nition. 13225–13234

work page
[13]

Anshu Hu, Yifei Sun, Jiachen Li, Yanchun Ma, Qing Xie, and Yongjian Liu. 2025. TOVect: Topology-Optimized Vectorization for Intangible Cultural Heritage Thangka Element Line Art. InProceedings of the 7th ACM International Conference on Multimedia in Asia. 1–7

work page 2025
[14]

Yadong Huo, Qibing Qin, Jiangyan Dai, Lei Wang, Wenfeng Zhang, Lei Huang, and Chengduan Wang. 2024. Deep Semantic-Aware Proxy Hashing for Multi- Label Cross-Modal Retrieval.IEEE Transactions on Circuits and Systems for Video Technology34, 1 (2024), 576–589

work page 2024
[15]

Chao Jia, Yinfei Yang, Ye Xia, Yi-Ting Chen, Zarana Parekh, Hieu Pham, Quoc Le, Yun-Hsuan Sung, Zhen Li, and Tom Duerig. 2021. Scaling up visual and vision- language representation learning with noisy text supervision. InInternational Conference on Machine Learning. 4904–4916

work page 2021
[16]

Ding Jiang and Mang Ye. 2023. Cross-Modal Implicit Relation Reasoning and Aligning for Text-to-Image Person Retrieval. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2787–2797

work page 2023
[17]

Xintong Jiang, Yaxiong Wang, Mengjian Li, Yujiao Wu, Bingwen Hu, and Xuem- ing Qian. 2024. CaLa: Complementary Association Learning for Augmenting Comoposed Image Retrieval. InProceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval. 2177–2187

work page 2024
[18]

Shyamgopal Karthik, Karsten Roth, Massimiliano Mancini, and Zeynep Akata

work page
[19]

International Conference on Learning Representations(2024)

Vision-by-Language for Training-Free Compositional Image Retrieval. International Conference on Learning Representations(2024)

work page 2024
[20]

Subhadeep Koley, Ayan Kumar Bhunia, Aneeshan Sain, Pinaki Nath Chowdhury, Tao Xiang, and Yi-Zhe Song. 2024. You’ll Never Walk Alone: A Sketch and Text Duet for Fine-Grained Image Retrieval. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 16509–16519

work page 2024
[21]

Jianjun Lei, Yuxin Song, Bo Peng, Zhanyu Ma, Ling Shao, and Yi-Zhe Song. 2020. Semi-Heterogeneous Three-Way Joint Embedding Network for Sketch-Based Image Retrieval.IEEE Transactions on Circuits and Systems for Video Technology 30, 9 (2020), 3226–3237

work page 2020
[22]

Jiachen Li, Hongyun Wang, Xiaolong Peng, Jinyu Xu, Qing Xie, Yanchun Ma, Wenbo Jiang, and Mengzi Tang. 2026. Guided by Principles of Composition: A Domain-Specific Priors Based Detector for Recognizing Ritual Implements in Thangka.IET Image Processing20, 1 (2026), e70271

work page 2026
[23]

Zixu Li, Zhiwei Chen, Haokun Wen, Zhiheng Fu, Yupeng Hu, and Weili Guan

work page
[24]

InProceedings of the AAAI Conference on Artificial Intelligence, Vol

Encoder: Entity mining and modification relation binding for composed image retrieval. InProceedings of the AAAI Conference on Artificial Intelligence, Vol. 39. 5101–5109

work page
[25]

Zhixin Ling, Zhen Xing, Jiangtong Li, and Li Niu. 2022. Multi-Level Region Matching for Fine-Grained Sketch-Based Image Retrieval. InProceedings of the 30th ACM International Conference on Multimedia. 462–470

work page 2022
[26]

Zheyuan Liu, Cristian Rodriguez-Opazo, Damien Teney, and Stephen Gould

work page
[27]

InProceedings of the IEEE/CVF International Conference on Computer Vision

Image retrieval on real-life images with pre-trained vision-and-language models. InProceedings of the IEEE/CVF International Conference on Computer Vision. 2105–2114

work page
[28]

Zheyuan Liu, Weixuan Sun, Yicong Hong, Damien Teney, and Stephen Gould

work page
[29]

InProceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision

Bi-directional training for composed image retrieval via text prompt learning. InProceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. 5753–5762

work page
[30]

Christian Lülf, Denis Mayr Lima Martins, Marcos Antonio Vaz Salles, Yongluan Zhou, and Fabian Gieseke. 2024. CLIP-Branches: Interactive Fine-Tuning for Text- Image Retrieval. InProceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval. 2719–2723

work page 2024
[31]

OpenAI. 2024. GPT-4o mini Model. https://platform.openai.com/docs/models/ gpt-4o-mini

work page 2024
[32]

OpenAI. 2025. GPT-4.1 Model. https://platform.openai.com/docs/models/gpt-4.1

work page 2025
[33]

Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Köpf, Edward Yang, Zach DeVito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. 2019. Pytorch: An imperative style, high-pe...

work page 2019
[34]

Yonggang Qi, Yi-Zhe Song, Honggang Zhang, and Jun Liu. 2016. Sketch-based im- age retrieval via Siamese convolutional neural network. In2016 IEEE International Conference on Image Processing. 2460–2464

work page 2016
[35]

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. 2021. Learning Transferable Visual Models From Natural Language Supervision. InInternational Conference on Machine Learning. 8748–8763

work page 2021
[36]

Hui Ren, Ke Sun, Fanhua Zhao, and Xian Zhu. 2024. Dunhuang murals image restoration method based on generative adversarial network.Heritage Science12, 1 (2024), 39

work page 2024
[37]

Aneeshan Sain, Ayan Kumar Bhunia, Subhadeep Koley, Pinaki Nath Chowdhury, Soumitri Chattopadhyay, Tao Xiang, and Yi-Zhe Song. 2023. Exploiting Unla- belled Photos for Stronger Fine-Grained SBIR. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 6873–6883

work page 2023
[38]

Kuniaki Saito, Kihyuk Sohn, Xiang Zhang, Chun-Liang Li, Chen-Yu Lee, Kate Saenko, and Tomas Pfister. 2023. Pic2Word: Mapping Pictures to Words for Zero-shot Composed Image Retrieval. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 19305–19314

work page 2023
[39]

Omar Seddati, Stéphane Dupont, and Saïd Mahmoudi. 2017. Triplet Networks Feature Masking for Sketch-Based Image Retrieval. InInternational Conference on Image Analysis and Recognition. 296–303

work page 2017
[40]

Piyush Sharma, Nan Ding, Sebastian Goodman, and Radu Soricut. 2018. Concep- tual Captions: A Cleaned, Hypernymed, Image Alt-text Dataset For Automatic Image Captioning. InProceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2556–2565

work page 2018
[41]

Haifeng Sun, Jiaqing Xu, Jingyu Wang, Qi Qi, Ce Ge, and Jianxin Liao. 2022. DLI-Net: Dual Local Interaction Network for Fine-Grained Sketch-Based Image Retrieval.IEEE Transactions on Circuits and Systems for Video Technology32, 10 (2022), 7177–7189

work page 2022
[42]

Shitong Sun, Fanghua Ye, and Shaogang Gong. 2024. Training-free Zero- shot Composed Image Retrieval with Local Concept Reranking. (2024). arXiv:2312.08924 [cs.CV] https://arxiv.org/abs/2312.08924

work page arXiv 2024
[43]

Yi Sun, Jinyu Xu, Qing Xie, Jiachen Li, Yanchun Ma, and Yongjian Liu. 2026. SDR-CIR: Semantic Debias Retrieval Framework for Training-Free Zero-Shot Composed Image Retrieval. InProceedings of the ACM Web Conference 2026. 2149–2159

work page 2026
[44]

Zelong Sun, Dong Jing, and Zhiwu Lu. 2025. CoTMR: Chain-of-Thought Multi- Scale Reasoning for Training-Free Zero-Shot Composed Image Retrieval. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 22675– 22684

work page 2025
[45]

Yucheng Suo, Fan Ma, Linchao Zhu, and Yi Yang. 2024. Knowledge-Enhanced Dual-Stream Zero-Shot Composed Image Retrieval. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 26941–26952

work page 2024
[46]

Yuanmin Tang, Jing Yu, Keke Gai, Jiamin Zhuang, Gang Xiong, Yue Hu, and Qi Wu

work page
[47]

InProceedings of the AAAI Conference on Artificial Intelligence, Vol

Context-I2W: mapping images to context-dependent words for accurate zero-shot composed image retrieval. InProceedings of the AAAI Conference on Artificial Intelligence, Vol. 38. 5180–5188

work page
[48]

Yuanmin Tang, Jue Zhang, Xiaoting Qin, Jing Yu, Gaopeng Gou, Gang Xiong, Qingwei Lin, Saravan Rajmohan, Dongmei Zhang, and Qi Wu. 2025. Reason- before-Retrieve: One-Stage Reflective Chain-of-Thoughts for Training-Free Zero- Shot Composed Image Retrieval. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 14400–14410. A Sk...

work page 2025
[49]

Sagar Vaze, Nicolas Carion, and Ishan Misra. 2023. GeneCIS: A Benchmark for General Conditional Image Similarity. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 6862–6872

work page 2023
[50]

Nam Vo, Lu Jiang, Chen Sun, Kevin Murphy, Li-Jia Li, Li Fei-Fei, and James Hays

work page
[51]

In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Composing Text and Image for Image Retrieval - an Empirical Odyssey. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 6432–6441

work page
[52]

Lan Wang, Wei Ao, Vishnu Naresh Boddeti, and Ser-Nam Lim. 2025. Generative Zero-Shot Composed Image Retrieval. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 29690–29700

work page 2025
[53]

Nianyi Wang, Weilan Wang, Wenjin Hu, Aaron Fenster, and Shuo Li. 2021. Thanka Mural Inpainting Based on Multi-Scale Adaptive Partial Convolution and Stroke- Like Mask.IEEE Transactions on Image Processing30 (2021), 3720–3733

work page 2021
[54]

Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Yang Fan, Kai Dang, Mengfei Du, Xuancheng Ren, Rui Men, Dayiheng Liu, Chang Zhou, Jingren Zhou, and Junyang Lin. 2024. Qwen2-VL: Enhancing Vision-Language Model’s Perception of the World at Any Resolution.arXiv preprint arXiv:2409.12191(2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[55]

Hui Wu, Yupeng Gao, Xiaoxiao Guo, Ziad Al-Halah, Steven Rennie, Kristen Grauman, and Rogerio Feris. 2021. Fashion IQ: A New Dataset Towards Retrieving Images by Natural Language Feedback. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 11302–11312

work page 2021
[56]

Jinyu Xu, Qing Xie, Jiachen Li, Zhifeng Bao, Yanchun Ma, and Yongjian Liu

work page
[57]

Enhancing Fine-Grained Sketch-based Image Retrieval through Contextual Information.IEEE Transactions on Multimedia(2026), 1–12

work page 2026
[58]

Zhenyu Yang, Shengsheng Qian, Dizhan Xue, Jiahong Wu, Fan Yang, Weiming Dong, and Changsheng Xu. 2024. Semantic Editing Increment Benefits Zero- Shot Composed Image Retrieval. InProceedings of the 32nd ACM International Conference on Multimedia. 1245–1254

work page 2024
[59]

Zhenyu Yang, Dizhan Xue, Shengsheng Qian, Weiming Dong, and Changsheng Xu. 2024. LDRE: LLM-based Divergent Reasoning and Ensemble for Zero-Shot Composed Image Retrieval. InProceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval. 80–90

work page 2024
[60]

Lewei Yao, Runhui Huang, Lu Hou, Guansong Lu, Minzhe Niu, Hang Xu, Xiaodan Liang, Zhenguo Li, Xin Jiang, and Chunjing Xu. 2022. FILIP: Fine-grained In- teractive Language-Image Pre-Training. InInternational Conference on Learning Representations

work page 2022
[61]

Ying Zheng, Hongxun Yao, and Xiaoshuai Sun. 2021. Deep Semantic Parsing of Freehand Sketches With Homogeneous Transformation, Soft-Weighted Loss, and Staged Learning.IEEE Transactions on Multimedia23 (2021), 3590–3602

work page 2021

[1] [1]

Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, Humen Zhong, Yuanzhi Zhu, Mingkun Yang, Zhaohai Li, Jianqiang Wan, Pengfei Wang, Wei Ding, Zheren Fu, Yiheng Xu, Jiabo Ye, Xi Zhang, Tianbao Xie, Zesen Cheng, Hang Zhang, Zhibo Yang, Haiyang Xu, and Junyang Lin. 2025. Qwen2.5-VL Technical Rep...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[2] [2]

Alberto Baldrati, Lorenzo Agnolucci, Marco Bertini, and Alberto Del Bimbo. 2023. Zero-Shot Composed Image Retrieval with Textual Inversion. InProceedings of the IEEE/CVF International Conference on Computer Vision. 15292–15301

work page 2023

[3] [3]

Alberto Baldrati, Marco Bertini, Tiberio Uricchio, and Alberto Del Bimbo. 2022. Conditioned and composed image retrieval combining and partially fine-tuning clip-based features. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 4959–4968

work page 2022

[4] [4]

Alberto Baldrati, Marco Bertini, Tiberio Uricchio, and Alberto Del Bimbo. 2022. Effective conditioned and composed image retrieval combining clip-based fea- tures. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 21466–21474

work page 2022

[5] [5]

Alberto Baldrati, Marco Bertini, Tiberio Uricchio, and Alberto Del Bimbo. 2023. Composed Image Retrieval using Contrastive Learning and Task-oriented CLIP- based Features.ACM Transactions on Multimedia Computing, Communications and Applications20, 3 (2023), 1–24

work page 2023

[6] [6]

Tu Bui, Leonardo Ribeiro, Moacir Ponti, and John Collomosse. 2018. Sketching out the details: Sketch-based image retrieval using convolutional neural networks with multi-stage regression.Computers & Graphics71 (2018), 77–87

work page 2018

[7] [7]

John Collomosse, Tu Bui, and Hailin Jin. 2019. LiveSketch: Query Perturbations for Guided Sketch-Based Visual Search. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2874–2882

work page 2019

[8] [8]

Gaohuan Dong, Qing Xie, Jiachen Li, Yanchun Ma, Yuhan Liu, and Yongjian Liu

work page

[9] [9]

In Proceedings of the 5th ACM International Conference on Multimedia in Asia

A multi-scale and dense object detector for tibetan thangka images. In Proceedings of the 5th ACM International Conference on Multimedia in Asia. 1–7

work page

[10] [10]

Prajwal Gatti, Kshitij Parikh, Dhriti Prasanna Paul, Manish Gupta, and Anand Mishra. 2024. Composite Sketch+Text queries for retrieving objects with elu- sive names and complex interactions. InProceedings of the AAAI Conference on Artificial Intelligence, Vol. 38. 1869–1877

work page 2024

[11] [11]

Geonmo Gu, Sanghyuk Chun, Wonjae Kim, Yoohoon Kang, and Sangdoo Yun

work page

[12] [12]

InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recog- nition

Language-only Efficient Training of Zero-shot Composed Image Retrieval. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recog- nition. 13225–13234

work page

[13] [13]

Anshu Hu, Yifei Sun, Jiachen Li, Yanchun Ma, Qing Xie, and Yongjian Liu. 2025. TOVect: Topology-Optimized Vectorization for Intangible Cultural Heritage Thangka Element Line Art. InProceedings of the 7th ACM International Conference on Multimedia in Asia. 1–7

work page 2025

[14] [14]

Yadong Huo, Qibing Qin, Jiangyan Dai, Lei Wang, Wenfeng Zhang, Lei Huang, and Chengduan Wang. 2024. Deep Semantic-Aware Proxy Hashing for Multi- Label Cross-Modal Retrieval.IEEE Transactions on Circuits and Systems for Video Technology34, 1 (2024), 576–589

work page 2024

[15] [15]

Chao Jia, Yinfei Yang, Ye Xia, Yi-Ting Chen, Zarana Parekh, Hieu Pham, Quoc Le, Yun-Hsuan Sung, Zhen Li, and Tom Duerig. 2021. Scaling up visual and vision- language representation learning with noisy text supervision. InInternational Conference on Machine Learning. 4904–4916

work page 2021

[16] [16]

Ding Jiang and Mang Ye. 2023. Cross-Modal Implicit Relation Reasoning and Aligning for Text-to-Image Person Retrieval. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2787–2797

work page 2023

[17] [17]

Xintong Jiang, Yaxiong Wang, Mengjian Li, Yujiao Wu, Bingwen Hu, and Xuem- ing Qian. 2024. CaLa: Complementary Association Learning for Augmenting Comoposed Image Retrieval. InProceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval. 2177–2187

work page 2024

[18] [18]

Shyamgopal Karthik, Karsten Roth, Massimiliano Mancini, and Zeynep Akata

work page

[19] [19]

International Conference on Learning Representations(2024)

Vision-by-Language for Training-Free Compositional Image Retrieval. International Conference on Learning Representations(2024)

work page 2024

[20] [20]

Subhadeep Koley, Ayan Kumar Bhunia, Aneeshan Sain, Pinaki Nath Chowdhury, Tao Xiang, and Yi-Zhe Song. 2024. You’ll Never Walk Alone: A Sketch and Text Duet for Fine-Grained Image Retrieval. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 16509–16519

work page 2024

[21] [21]

Jianjun Lei, Yuxin Song, Bo Peng, Zhanyu Ma, Ling Shao, and Yi-Zhe Song. 2020. Semi-Heterogeneous Three-Way Joint Embedding Network for Sketch-Based Image Retrieval.IEEE Transactions on Circuits and Systems for Video Technology 30, 9 (2020), 3226–3237

work page 2020

[22] [22]

Jiachen Li, Hongyun Wang, Xiaolong Peng, Jinyu Xu, Qing Xie, Yanchun Ma, Wenbo Jiang, and Mengzi Tang. 2026. Guided by Principles of Composition: A Domain-Specific Priors Based Detector for Recognizing Ritual Implements in Thangka.IET Image Processing20, 1 (2026), e70271

work page 2026

[23] [23]

Zixu Li, Zhiwei Chen, Haokun Wen, Zhiheng Fu, Yupeng Hu, and Weili Guan

work page

[24] [24]

InProceedings of the AAAI Conference on Artificial Intelligence, Vol

Encoder: Entity mining and modification relation binding for composed image retrieval. InProceedings of the AAAI Conference on Artificial Intelligence, Vol. 39. 5101–5109

work page

[25] [25]

Zhixin Ling, Zhen Xing, Jiangtong Li, and Li Niu. 2022. Multi-Level Region Matching for Fine-Grained Sketch-Based Image Retrieval. InProceedings of the 30th ACM International Conference on Multimedia. 462–470

work page 2022

[26] [26]

Zheyuan Liu, Cristian Rodriguez-Opazo, Damien Teney, and Stephen Gould

work page

[27] [27]

InProceedings of the IEEE/CVF International Conference on Computer Vision

Image retrieval on real-life images with pre-trained vision-and-language models. InProceedings of the IEEE/CVF International Conference on Computer Vision. 2105–2114

work page

[28] [28]

Zheyuan Liu, Weixuan Sun, Yicong Hong, Damien Teney, and Stephen Gould

work page

[29] [29]

InProceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision

Bi-directional training for composed image retrieval via text prompt learning. InProceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. 5753–5762

work page

[30] [30]

Christian Lülf, Denis Mayr Lima Martins, Marcos Antonio Vaz Salles, Yongluan Zhou, and Fabian Gieseke. 2024. CLIP-Branches: Interactive Fine-Tuning for Text- Image Retrieval. InProceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval. 2719–2723

work page 2024

[31] [31]

OpenAI. 2024. GPT-4o mini Model. https://platform.openai.com/docs/models/ gpt-4o-mini

work page 2024

[32] [32]

OpenAI. 2025. GPT-4.1 Model. https://platform.openai.com/docs/models/gpt-4.1

work page 2025

[33] [33]

Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Köpf, Edward Yang, Zach DeVito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. 2019. Pytorch: An imperative style, high-pe...

work page 2019

[34] [34]

Yonggang Qi, Yi-Zhe Song, Honggang Zhang, and Jun Liu. 2016. Sketch-based im- age retrieval via Siamese convolutional neural network. In2016 IEEE International Conference on Image Processing. 2460–2464

work page 2016

[35] [35]

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. 2021. Learning Transferable Visual Models From Natural Language Supervision. InInternational Conference on Machine Learning. 8748–8763

work page 2021

[36] [36]

Hui Ren, Ke Sun, Fanhua Zhao, and Xian Zhu. 2024. Dunhuang murals image restoration method based on generative adversarial network.Heritage Science12, 1 (2024), 39

work page 2024

[37] [37]

Aneeshan Sain, Ayan Kumar Bhunia, Subhadeep Koley, Pinaki Nath Chowdhury, Soumitri Chattopadhyay, Tao Xiang, and Yi-Zhe Song. 2023. Exploiting Unla- belled Photos for Stronger Fine-Grained SBIR. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 6873–6883

work page 2023

[38] [38]

Kuniaki Saito, Kihyuk Sohn, Xiang Zhang, Chun-Liang Li, Chen-Yu Lee, Kate Saenko, and Tomas Pfister. 2023. Pic2Word: Mapping Pictures to Words for Zero-shot Composed Image Retrieval. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 19305–19314

work page 2023

[39] [39]

Omar Seddati, Stéphane Dupont, and Saïd Mahmoudi. 2017. Triplet Networks Feature Masking for Sketch-Based Image Retrieval. InInternational Conference on Image Analysis and Recognition. 296–303

work page 2017

[40] [40]

Piyush Sharma, Nan Ding, Sebastian Goodman, and Radu Soricut. 2018. Concep- tual Captions: A Cleaned, Hypernymed, Image Alt-text Dataset For Automatic Image Captioning. InProceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2556–2565

work page 2018

[41] [41]

Haifeng Sun, Jiaqing Xu, Jingyu Wang, Qi Qi, Ce Ge, and Jianxin Liao. 2022. DLI-Net: Dual Local Interaction Network for Fine-Grained Sketch-Based Image Retrieval.IEEE Transactions on Circuits and Systems for Video Technology32, 10 (2022), 7177–7189

work page 2022

[42] [42]

Shitong Sun, Fanghua Ye, and Shaogang Gong. 2024. Training-free Zero- shot Composed Image Retrieval with Local Concept Reranking. (2024). arXiv:2312.08924 [cs.CV] https://arxiv.org/abs/2312.08924

work page arXiv 2024

[43] [43]

Yi Sun, Jinyu Xu, Qing Xie, Jiachen Li, Yanchun Ma, and Yongjian Liu. 2026. SDR-CIR: Semantic Debias Retrieval Framework for Training-Free Zero-Shot Composed Image Retrieval. InProceedings of the ACM Web Conference 2026. 2149–2159

work page 2026

[44] [44]

Zelong Sun, Dong Jing, and Zhiwu Lu. 2025. CoTMR: Chain-of-Thought Multi- Scale Reasoning for Training-Free Zero-Shot Composed Image Retrieval. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 22675– 22684

work page 2025

[45] [45]

Yucheng Suo, Fan Ma, Linchao Zhu, and Yi Yang. 2024. Knowledge-Enhanced Dual-Stream Zero-Shot Composed Image Retrieval. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 26941–26952

work page 2024

[46] [46]

Yuanmin Tang, Jing Yu, Keke Gai, Jiamin Zhuang, Gang Xiong, Yue Hu, and Qi Wu

work page

[47] [47]

InProceedings of the AAAI Conference on Artificial Intelligence, Vol

Context-I2W: mapping images to context-dependent words for accurate zero-shot composed image retrieval. InProceedings of the AAAI Conference on Artificial Intelligence, Vol. 38. 5180–5188

work page

[48] [48]

Yuanmin Tang, Jue Zhang, Xiaoting Qin, Jing Yu, Gaopeng Gou, Gang Xiong, Qingwei Lin, Saravan Rajmohan, Dongmei Zhang, and Qi Wu. 2025. Reason- before-Retrieve: One-Stage Reflective Chain-of-Thoughts for Training-Free Zero- Shot Composed Image Retrieval. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 14400–14410. A Sk...

work page 2025

[49] [49]

Sagar Vaze, Nicolas Carion, and Ishan Misra. 2023. GeneCIS: A Benchmark for General Conditional Image Similarity. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 6862–6872

work page 2023

[50] [50]

Nam Vo, Lu Jiang, Chen Sun, Kevin Murphy, Li-Jia Li, Li Fei-Fei, and James Hays

work page

[51] [51]

In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Composing Text and Image for Image Retrieval - an Empirical Odyssey. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 6432–6441

work page

[52] [52]

Lan Wang, Wei Ao, Vishnu Naresh Boddeti, and Ser-Nam Lim. 2025. Generative Zero-Shot Composed Image Retrieval. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 29690–29700

work page 2025

[53] [53]

Nianyi Wang, Weilan Wang, Wenjin Hu, Aaron Fenster, and Shuo Li. 2021. Thanka Mural Inpainting Based on Multi-Scale Adaptive Partial Convolution and Stroke- Like Mask.IEEE Transactions on Image Processing30 (2021), 3720–3733

work page 2021

[54] [54]

Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Yang Fan, Kai Dang, Mengfei Du, Xuancheng Ren, Rui Men, Dayiheng Liu, Chang Zhou, Jingren Zhou, and Junyang Lin. 2024. Qwen2-VL: Enhancing Vision-Language Model’s Perception of the World at Any Resolution.arXiv preprint arXiv:2409.12191(2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024

[55] [55]

Hui Wu, Yupeng Gao, Xiaoxiao Guo, Ziad Al-Halah, Steven Rennie, Kristen Grauman, and Rogerio Feris. 2021. Fashion IQ: A New Dataset Towards Retrieving Images by Natural Language Feedback. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 11302–11312

work page 2021

[56] [56]

Jinyu Xu, Qing Xie, Jiachen Li, Zhifeng Bao, Yanchun Ma, and Yongjian Liu

work page

[57] [57]

Enhancing Fine-Grained Sketch-based Image Retrieval through Contextual Information.IEEE Transactions on Multimedia(2026), 1–12

work page 2026

[58] [58]

Zhenyu Yang, Shengsheng Qian, Dizhan Xue, Jiahong Wu, Fan Yang, Weiming Dong, and Changsheng Xu. 2024. Semantic Editing Increment Benefits Zero- Shot Composed Image Retrieval. InProceedings of the 32nd ACM International Conference on Multimedia. 1245–1254

work page 2024

[59] [59]

Zhenyu Yang, Dizhan Xue, Shengsheng Qian, Weiming Dong, and Changsheng Xu. 2024. LDRE: LLM-based Divergent Reasoning and Ensemble for Zero-Shot Composed Image Retrieval. InProceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval. 80–90

work page 2024

[60] [60]

Lewei Yao, Runhui Huang, Lu Hou, Guansong Lu, Minzhe Niu, Hang Xu, Xiaodan Liang, Zhenguo Li, Xin Jiang, and Chunjing Xu. 2022. FILIP: Fine-grained In- teractive Language-Image Pre-Training. InInternational Conference on Learning Representations

work page 2022

[61] [61]

Ying Zheng, Hongxun Yao, and Xiaoshuai Sun. 2021. Deep Semantic Parsing of Freehand Sketches With Homogeneous Transformation, Soft-Weighted Loss, and Staged Learning.IEEE Transactions on Multimedia23 (2021), 3590–3602

work page 2021