pith. sign in

arxiv: 2602.08411 · v2 · submitted 2026-02-09 · 💻 cs.IR

A Sketch+Text Composed Image Retrieval Dataset for Thangka

Pith reviewed 2026-05-16 06:06 UTC · model grok-4.3

classification 💻 cs.IR
keywords Thangka imagerycomposed image retrievalsketch querieshierarchical textcultural heritagemultimodal datasetfine-grained retrievaldomain-specific benchmark
0
0 comments X

The pith

Existing composed image retrieval methods struggle to align sketches and hierarchical text with fine-grained Thangka images without domain-specific training.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces CIRThan, a dataset of 2,287 Thangka images each paired with a human-drawn sketch and textual descriptions at three semantic levels. Thangka paintings contain dense symbolic structures and cultural conventions that demand precise structural and semantic matching. Standard CIR techniques developed on general imagery perform poorly when combining sketch abstractions with multi-level text on this material. The work supplies data splits, analysis, and baseline results to measure these gaps. It positions the dataset as a tool for improving retrieval in cultural heritage and other specialized visual domains.

Core claim

CIRThan supplies 2,287 high-quality Thangka images together with human sketches and hierarchical textual descriptions at three semantic levels, allowing composed queries that combine structural intent with multi-level semantic specification. Benchmark evaluations of representative supervised and zero-shot CIR methods show that existing approaches largely developed for general-domain imagery struggle to align sketch-based abstractions and hierarchical textual semantics with these fine-grained, symbolically dense images, particularly without in-domain supervision.

What carries the argument

The CIRThan dataset, which pairs each Thangka image with a sketch for structural abstraction and hierarchical text for multi-level semantic specification to support composed retrieval queries.

If this is right

  • Supervised in-domain training substantially outperforms zero-shot transfer on sketch-plus-text queries for structured cultural imagery.
  • Hierarchical textual descriptions are necessary to capture the dense symbolic elements that single-level text misses in Thangka retrieval.
  • New alignment techniques are required to map abstract sketches onto the intricate visual conventions of knowledge-specific image domains.
  • The dataset enables systematic testing of multimodal methods for other cultural-heritage retrieval tasks with similar structural complexity.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Methods that succeed on this benchmark may generalize to other domains with layered symbolic content such as technical diagrams or religious iconography.
  • Adding explicit cultural knowledge bases to the retrieval pipeline could address semantic conventions that sketches and text alone do not fully convey.
  • Varying sketch abstraction levels in follow-up experiments would clarify how much drawing precision is needed for reliable retrieval.
  • The same data-collection approach could be applied to create benchmarks for additional non-Western art traditions that share hierarchical visual grammars.

Load-bearing premise

The 2,287 selected Thangka images and their human annotations are representative enough of the full range of Thangka complexity and semantic conventions to serve as a reliable benchmark.

What would settle it

A general-domain CIR model achieving high retrieval accuracy on the CIRThan test split without any Thangka-specific training data or fine-tuning would show that the observed struggles are not inherent to the domain.

Figures

Figures reproduced from arXiv: 2602.08411 by Daomin Ji, Jiachen Li, Jiangling Zhang, Jinyu Xu, Qing Xie, Yanchun Ma, Yi Sun, Yongjian Liu, Zhifeng Bao.

Figure 1
Figure 1. Figure 1: Comparison of composed query examples from [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: The data construction pipeline of the CIRThan [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Statistical analysis of image resolution and subject distribution in the dataset. [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗
read the original abstract

Composed Image Retrieval (CIR) enables image retrieval by combining multiple query modalities, but existing benchmarks predominantly focus on general-domain imagery and rely on reference images with short textual modifications. As a result, they provide limited support for retrieval scenarios that require fine-grained semantic reasoning, structured visual understanding, and domain-specific knowledge. In this work, we introduce CIRThan, a sketch+text Composed Image Retrieval dataset for Thangka imagery, a culturally grounded and knowledge-specific visual domain characterized by complex structures, dense symbolic elements, and domain-dependent semantic conventions. CIRThan contains 2,287 high-quality Thangka images, each paired with a human-drawn sketch and hierarchical textual descriptions at three semantic levels, enabling composed queries that jointly express structural intent and multi-level semantic specification. We provide standardized data splits, comprehensive dataset analysis, and benchmark evaluations of representative supervised and zero-shot CIR methods. Experimental results reveal that existing CIR approaches, largely developed for general-domain imagery, struggle to effectively align sketch-based abstractions and hierarchical textual semantics with fine-grained Thangka images, particularly without in-domain supervision. We believe CIRThan offers a valuable benchmark for advancing sketch+text CIR, hierarchical semantic modeling, and multimodal retrieval in cultural heritage and other knowledge-specific visual domains. The dataset is publicly available at https://github.com/jinyuxu-whut/CIRThan.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper introduces CIRThan, a sketch+text composed image retrieval dataset for Thangka imagery containing 2,287 high-quality images each paired with a human-drawn sketch and hierarchical textual descriptions at three semantic levels. It supplies standardized splits, dataset analysis, and benchmark evaluations of supervised and zero-shot CIR methods, claiming that existing general-domain approaches struggle to align sketch abstractions and hierarchical textual semantics with fine-grained Thangka images, especially without in-domain supervision.

Significance. If the 2,287 images prove representative of Thangka structural density, symbolic conventions, and semantic hierarchies, the dataset would constitute a valuable benchmark for advancing sketch+text CIR, hierarchical semantic modeling, and multimodal retrieval in cultural heritage domains. The public release and initial results highlighting domain-specific gaps are constructive contributions.

major comments (2)
  1. [Dataset construction] Dataset construction section: The selection of the 2,287 Thangka images is described as 'high-quality' and 'culturally grounded' but lacks quantitative coverage statistics over motif types, style periods, or compositional complexity. This is load-bearing for the central claim that observed performance gaps reflect intrinsic domain difficulty rather than selection bias.
  2. [Benchmark evaluations] Benchmark evaluations section: Details on annotation protocols, inter-annotator agreement, quality control procedures, and precise evaluation metrics (e.g., recall@k definitions) are insufficient, limiting full verification of the reported performance claims.
minor comments (1)
  1. [Dataset analysis] The manuscript would benefit from explicit clarification of the 'comprehensive dataset analysis' metrics and any tables summarizing motif or style distributions.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment point by point below, indicating the revisions we will incorporate to improve clarity and rigor.

read point-by-point responses
  1. Referee: [Dataset construction] Dataset construction section: The selection of the 2,287 Thangka images is described as 'high-quality' and 'culturally grounded' but lacks quantitative coverage statistics over motif types, style periods, or compositional complexity. This is load-bearing for the central claim that observed performance gaps reflect intrinsic domain difficulty rather than selection bias.

    Authors: We agree that quantitative coverage statistics would strengthen the argument for representativeness and help distinguish domain difficulty from potential selection effects. In the revised manuscript, we will add a dedicated subsection with tables reporting the distribution across motif types (e.g., percentages for Buddha figures, deities, mandalas, and narrative scenes), style periods (e.g., counts from major historical eras such as Yuan, Ming, and Qing), and compositional complexity metrics (e.g., average number of symbolic elements and structural layers per image). These statistics are available from our expert curation logs and will be presented to support the claim that the observed gaps arise from intrinsic Thangka characteristics. revision: yes

  2. Referee: [Benchmark evaluations] Benchmark evaluations section: Details on annotation protocols, inter-annotator agreement, quality control procedures, and precise evaluation metrics (e.g., recall@k definitions) are insufficient, limiting full verification of the reported performance claims.

    Authors: We acknowledge that expanded details on annotation and evaluation are required for reproducibility. In the revision, we will augment the relevant sections with: (1) explicit annotation protocols for sketch drawing and the three-level hierarchical texts, including guidelines provided to annotators; (2) inter-annotator agreement scores (e.g., Fleiss' kappa computed on a subset of text descriptions); (3) quality control procedures such as expert review rounds and resolution of disagreements; and (4) precise metric definitions, including the exact formulation of recall@k for composed sketch+text queries. These additions will allow full verification of the benchmark results. revision: yes

Circularity Check

0 steps flagged

No circularity: dataset introduction with direct empirical benchmarks

full rationale

The paper introduces the CIRThan dataset (2,287 Thangka images with sketches and three-level annotations) and reports benchmark results on existing CIR methods. No derivation chain, equations, fitted parameters, or predictions exist that reduce to self-defined inputs. Central claims rest on empirical evaluation rather than self-referential definitions or self-citation load-bearing. The representativeness assumption is a validity concern but does not create circularity in any derivation.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central contribution rests on the domain assumption that Thangka imagery possesses unique structural and semantic properties not captured by existing general-domain datasets.

axioms (1)
  • domain assumption Thangka images require domain-specific knowledge due to complex structures, dense symbolic elements, and domain-dependent semantic conventions.
    Invoked in the abstract to justify why general CIR methods are insufficient.

pith-pipeline@v0.9.0 · 5560 in / 1207 out tokens · 53941 ms · 2026-05-16T06:06:27.339368+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

61 extracted references · 61 canonical work pages · 2 internal anchors

  1. [1]

    Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, Humen Zhong, Yuanzhi Zhu, Mingkun Yang, Zhaohai Li, Jianqiang Wan, Pengfei Wang, Wei Ding, Zheren Fu, Yiheng Xu, Jiabo Ye, Xi Zhang, Tianbao Xie, Zesen Cheng, Hang Zhang, Zhibo Yang, Haiyang Xu, and Junyang Lin. 2025. Qwen2.5-VL Technical Rep...

  2. [2]

    Alberto Baldrati, Lorenzo Agnolucci, Marco Bertini, and Alberto Del Bimbo. 2023. Zero-Shot Composed Image Retrieval with Textual Inversion. InProceedings of the IEEE/CVF International Conference on Computer Vision. 15292–15301

  3. [3]

    Alberto Baldrati, Marco Bertini, Tiberio Uricchio, and Alberto Del Bimbo. 2022. Conditioned and composed image retrieval combining and partially fine-tuning clip-based features. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 4959–4968

  4. [4]

    Alberto Baldrati, Marco Bertini, Tiberio Uricchio, and Alberto Del Bimbo. 2022. Effective conditioned and composed image retrieval combining clip-based fea- tures. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 21466–21474

  5. [5]

    Alberto Baldrati, Marco Bertini, Tiberio Uricchio, and Alberto Del Bimbo. 2023. Composed Image Retrieval using Contrastive Learning and Task-oriented CLIP- based Features.ACM Transactions on Multimedia Computing, Communications and Applications20, 3 (2023), 1–24

  6. [6]

    Tu Bui, Leonardo Ribeiro, Moacir Ponti, and John Collomosse. 2018. Sketching out the details: Sketch-based image retrieval using convolutional neural networks with multi-stage regression.Computers & Graphics71 (2018), 77–87

  7. [7]

    John Collomosse, Tu Bui, and Hailin Jin. 2019. LiveSketch: Query Perturbations for Guided Sketch-Based Visual Search. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2874–2882

  8. [8]

    Gaohuan Dong, Qing Xie, Jiachen Li, Yanchun Ma, Yuhan Liu, and Yongjian Liu

  9. [9]

    In Proceedings of the 5th ACM International Conference on Multimedia in Asia

    A multi-scale and dense object detector for tibetan thangka images. In Proceedings of the 5th ACM International Conference on Multimedia in Asia. 1–7

  10. [10]

    Prajwal Gatti, Kshitij Parikh, Dhriti Prasanna Paul, Manish Gupta, and Anand Mishra. 2024. Composite Sketch+Text queries for retrieving objects with elu- sive names and complex interactions. InProceedings of the AAAI Conference on Artificial Intelligence, Vol. 38. 1869–1877

  11. [11]

    Geonmo Gu, Sanghyuk Chun, Wonjae Kim, Yoohoon Kang, and Sangdoo Yun

  12. [12]

    InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recog- nition

    Language-only Efficient Training of Zero-shot Composed Image Retrieval. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recog- nition. 13225–13234

  13. [13]

    Anshu Hu, Yifei Sun, Jiachen Li, Yanchun Ma, Qing Xie, and Yongjian Liu. 2025. TOVect: Topology-Optimized Vectorization for Intangible Cultural Heritage Thangka Element Line Art. InProceedings of the 7th ACM International Conference on Multimedia in Asia. 1–7

  14. [14]

    Yadong Huo, Qibing Qin, Jiangyan Dai, Lei Wang, Wenfeng Zhang, Lei Huang, and Chengduan Wang. 2024. Deep Semantic-Aware Proxy Hashing for Multi- Label Cross-Modal Retrieval.IEEE Transactions on Circuits and Systems for Video Technology34, 1 (2024), 576–589

  15. [15]

    Chao Jia, Yinfei Yang, Ye Xia, Yi-Ting Chen, Zarana Parekh, Hieu Pham, Quoc Le, Yun-Hsuan Sung, Zhen Li, and Tom Duerig. 2021. Scaling up visual and vision- language representation learning with noisy text supervision. InInternational Conference on Machine Learning. 4904–4916

  16. [16]

    Ding Jiang and Mang Ye. 2023. Cross-Modal Implicit Relation Reasoning and Aligning for Text-to-Image Person Retrieval. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2787–2797

  17. [17]

    Xintong Jiang, Yaxiong Wang, Mengjian Li, Yujiao Wu, Bingwen Hu, and Xuem- ing Qian. 2024. CaLa: Complementary Association Learning for Augmenting Comoposed Image Retrieval. InProceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval. 2177–2187

  18. [18]

    Shyamgopal Karthik, Karsten Roth, Massimiliano Mancini, and Zeynep Akata

  19. [19]

    International Conference on Learning Representations(2024)

    Vision-by-Language for Training-Free Compositional Image Retrieval. International Conference on Learning Representations(2024)

  20. [20]

    Subhadeep Koley, Ayan Kumar Bhunia, Aneeshan Sain, Pinaki Nath Chowdhury, Tao Xiang, and Yi-Zhe Song. 2024. You’ll Never Walk Alone: A Sketch and Text Duet for Fine-Grained Image Retrieval. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 16509–16519

  21. [21]

    Jianjun Lei, Yuxin Song, Bo Peng, Zhanyu Ma, Ling Shao, and Yi-Zhe Song. 2020. Semi-Heterogeneous Three-Way Joint Embedding Network for Sketch-Based Image Retrieval.IEEE Transactions on Circuits and Systems for Video Technology 30, 9 (2020), 3226–3237

  22. [22]

    Jiachen Li, Hongyun Wang, Xiaolong Peng, Jinyu Xu, Qing Xie, Yanchun Ma, Wenbo Jiang, and Mengzi Tang. 2026. Guided by Principles of Composition: A Domain-Specific Priors Based Detector for Recognizing Ritual Implements in Thangka.IET Image Processing20, 1 (2026), e70271

  23. [23]

    Zixu Li, Zhiwei Chen, Haokun Wen, Zhiheng Fu, Yupeng Hu, and Weili Guan

  24. [24]

    InProceedings of the AAAI Conference on Artificial Intelligence, Vol

    Encoder: Entity mining and modification relation binding for composed image retrieval. InProceedings of the AAAI Conference on Artificial Intelligence, Vol. 39. 5101–5109

  25. [25]

    Zhixin Ling, Zhen Xing, Jiangtong Li, and Li Niu. 2022. Multi-Level Region Matching for Fine-Grained Sketch-Based Image Retrieval. InProceedings of the 30th ACM International Conference on Multimedia. 462–470

  26. [26]

    Zheyuan Liu, Cristian Rodriguez-Opazo, Damien Teney, and Stephen Gould

  27. [27]

    InProceedings of the IEEE/CVF International Conference on Computer Vision

    Image retrieval on real-life images with pre-trained vision-and-language models. InProceedings of the IEEE/CVF International Conference on Computer Vision. 2105–2114

  28. [28]

    Zheyuan Liu, Weixuan Sun, Yicong Hong, Damien Teney, and Stephen Gould

  29. [29]

    InProceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision

    Bi-directional training for composed image retrieval via text prompt learning. InProceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. 5753–5762

  30. [30]

    Christian Lülf, Denis Mayr Lima Martins, Marcos Antonio Vaz Salles, Yongluan Zhou, and Fabian Gieseke. 2024. CLIP-Branches: Interactive Fine-Tuning for Text- Image Retrieval. InProceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval. 2719–2723

  31. [31]

    OpenAI. 2024. GPT-4o mini Model. https://platform.openai.com/docs/models/ gpt-4o-mini

  32. [32]

    OpenAI. 2025. GPT-4.1 Model. https://platform.openai.com/docs/models/gpt-4.1

  33. [33]

    Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Köpf, Edward Yang, Zach DeVito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. 2019. Pytorch: An imperative style, high-pe...

  34. [34]

    Yonggang Qi, Yi-Zhe Song, Honggang Zhang, and Jun Liu. 2016. Sketch-based im- age retrieval via Siamese convolutional neural network. In2016 IEEE International Conference on Image Processing. 2460–2464

  35. [35]

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. 2021. Learning Transferable Visual Models From Natural Language Supervision. InInternational Conference on Machine Learning. 8748–8763

  36. [36]

    Hui Ren, Ke Sun, Fanhua Zhao, and Xian Zhu. 2024. Dunhuang murals image restoration method based on generative adversarial network.Heritage Science12, 1 (2024), 39

  37. [37]

    Aneeshan Sain, Ayan Kumar Bhunia, Subhadeep Koley, Pinaki Nath Chowdhury, Soumitri Chattopadhyay, Tao Xiang, and Yi-Zhe Song. 2023. Exploiting Unla- belled Photos for Stronger Fine-Grained SBIR. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 6873–6883

  38. [38]

    Kuniaki Saito, Kihyuk Sohn, Xiang Zhang, Chun-Liang Li, Chen-Yu Lee, Kate Saenko, and Tomas Pfister. 2023. Pic2Word: Mapping Pictures to Words for Zero-shot Composed Image Retrieval. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 19305–19314

  39. [39]

    Omar Seddati, Stéphane Dupont, and Saïd Mahmoudi. 2017. Triplet Networks Feature Masking for Sketch-Based Image Retrieval. InInternational Conference on Image Analysis and Recognition. 296–303

  40. [40]

    Piyush Sharma, Nan Ding, Sebastian Goodman, and Radu Soricut. 2018. Concep- tual Captions: A Cleaned, Hypernymed, Image Alt-text Dataset For Automatic Image Captioning. InProceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2556–2565

  41. [41]

    Haifeng Sun, Jiaqing Xu, Jingyu Wang, Qi Qi, Ce Ge, and Jianxin Liao. 2022. DLI-Net: Dual Local Interaction Network for Fine-Grained Sketch-Based Image Retrieval.IEEE Transactions on Circuits and Systems for Video Technology32, 10 (2022), 7177–7189

  42. [42]

    Shitong Sun, Fanghua Ye, and Shaogang Gong. 2024. Training-free Zero- shot Composed Image Retrieval with Local Concept Reranking. (2024). arXiv:2312.08924 [cs.CV] https://arxiv.org/abs/2312.08924

  43. [43]

    Yi Sun, Jinyu Xu, Qing Xie, Jiachen Li, Yanchun Ma, and Yongjian Liu. 2026. SDR-CIR: Semantic Debias Retrieval Framework for Training-Free Zero-Shot Composed Image Retrieval. InProceedings of the ACM Web Conference 2026. 2149–2159

  44. [44]

    Zelong Sun, Dong Jing, and Zhiwu Lu. 2025. CoTMR: Chain-of-Thought Multi- Scale Reasoning for Training-Free Zero-Shot Composed Image Retrieval. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 22675– 22684

  45. [45]

    Yucheng Suo, Fan Ma, Linchao Zhu, and Yi Yang. 2024. Knowledge-Enhanced Dual-Stream Zero-Shot Composed Image Retrieval. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 26941–26952

  46. [46]

    Yuanmin Tang, Jing Yu, Keke Gai, Jiamin Zhuang, Gang Xiong, Yue Hu, and Qi Wu

  47. [47]

    InProceedings of the AAAI Conference on Artificial Intelligence, Vol

    Context-I2W: mapping images to context-dependent words for accurate zero-shot composed image retrieval. InProceedings of the AAAI Conference on Artificial Intelligence, Vol. 38. 5180–5188

  48. [48]

    Yuanmin Tang, Jue Zhang, Xiaoting Qin, Jing Yu, Gaopeng Gou, Gang Xiong, Qingwei Lin, Saravan Rajmohan, Dongmei Zhang, and Qi Wu. 2025. Reason- before-Retrieve: One-Stage Reflective Chain-of-Thoughts for Training-Free Zero- Shot Composed Image Retrieval. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 14400–14410. A Sk...

  49. [49]

    Sagar Vaze, Nicolas Carion, and Ishan Misra. 2023. GeneCIS: A Benchmark for General Conditional Image Similarity. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 6862–6872

  50. [50]

    Nam Vo, Lu Jiang, Chen Sun, Kevin Murphy, Li-Jia Li, Li Fei-Fei, and James Hays

  51. [51]

    In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    Composing Text and Image for Image Retrieval - an Empirical Odyssey. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 6432–6441

  52. [52]

    Lan Wang, Wei Ao, Vishnu Naresh Boddeti, and Ser-Nam Lim. 2025. Generative Zero-Shot Composed Image Retrieval. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 29690–29700

  53. [53]

    Nianyi Wang, Weilan Wang, Wenjin Hu, Aaron Fenster, and Shuo Li. 2021. Thanka Mural Inpainting Based on Multi-Scale Adaptive Partial Convolution and Stroke- Like Mask.IEEE Transactions on Image Processing30 (2021), 3720–3733

  54. [54]

    Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Yang Fan, Kai Dang, Mengfei Du, Xuancheng Ren, Rui Men, Dayiheng Liu, Chang Zhou, Jingren Zhou, and Junyang Lin. 2024. Qwen2-VL: Enhancing Vision-Language Model’s Perception of the World at Any Resolution.arXiv preprint arXiv:2409.12191(2024)

  55. [55]

    Hui Wu, Yupeng Gao, Xiaoxiao Guo, Ziad Al-Halah, Steven Rennie, Kristen Grauman, and Rogerio Feris. 2021. Fashion IQ: A New Dataset Towards Retrieving Images by Natural Language Feedback. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 11302–11312

  56. [56]

    Jinyu Xu, Qing Xie, Jiachen Li, Zhifeng Bao, Yanchun Ma, and Yongjian Liu

  57. [57]

    Enhancing Fine-Grained Sketch-based Image Retrieval through Contextual Information.IEEE Transactions on Multimedia(2026), 1–12

  58. [58]

    Zhenyu Yang, Shengsheng Qian, Dizhan Xue, Jiahong Wu, Fan Yang, Weiming Dong, and Changsheng Xu. 2024. Semantic Editing Increment Benefits Zero- Shot Composed Image Retrieval. InProceedings of the 32nd ACM International Conference on Multimedia. 1245–1254

  59. [59]

    Zhenyu Yang, Dizhan Xue, Shengsheng Qian, Weiming Dong, and Changsheng Xu. 2024. LDRE: LLM-based Divergent Reasoning and Ensemble for Zero-Shot Composed Image Retrieval. InProceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval. 80–90

  60. [60]

    Lewei Yao, Runhui Huang, Lu Hou, Guansong Lu, Minzhe Niu, Hang Xu, Xiaodan Liang, Zhenguo Li, Xin Jiang, and Chunjing Xu. 2022. FILIP: Fine-grained In- teractive Language-Image Pre-Training. InInternational Conference on Learning Representations

  61. [61]

    Ying Zheng, Hongxun Yao, and Xiaoshuai Sun. 2021. Deep Semantic Parsing of Freehand Sketches With Homogeneous Transformation, Soft-Weighted Loss, and Staged Learning.IEEE Transactions on Multimedia23 (2021), 3590–3602