A Sketch+Text Composed Image Retrieval Dataset for Thangka
Pith reviewed 2026-05-16 06:06 UTC · model grok-4.3
The pith
Existing composed image retrieval methods struggle to align sketches and hierarchical text with fine-grained Thangka images without domain-specific training.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
CIRThan supplies 2,287 high-quality Thangka images together with human sketches and hierarchical textual descriptions at three semantic levels, allowing composed queries that combine structural intent with multi-level semantic specification. Benchmark evaluations of representative supervised and zero-shot CIR methods show that existing approaches largely developed for general-domain imagery struggle to align sketch-based abstractions and hierarchical textual semantics with these fine-grained, symbolically dense images, particularly without in-domain supervision.
What carries the argument
The CIRThan dataset, which pairs each Thangka image with a sketch for structural abstraction and hierarchical text for multi-level semantic specification to support composed retrieval queries.
If this is right
- Supervised in-domain training substantially outperforms zero-shot transfer on sketch-plus-text queries for structured cultural imagery.
- Hierarchical textual descriptions are necessary to capture the dense symbolic elements that single-level text misses in Thangka retrieval.
- New alignment techniques are required to map abstract sketches onto the intricate visual conventions of knowledge-specific image domains.
- The dataset enables systematic testing of multimodal methods for other cultural-heritage retrieval tasks with similar structural complexity.
Where Pith is reading between the lines
- Methods that succeed on this benchmark may generalize to other domains with layered symbolic content such as technical diagrams or religious iconography.
- Adding explicit cultural knowledge bases to the retrieval pipeline could address semantic conventions that sketches and text alone do not fully convey.
- Varying sketch abstraction levels in follow-up experiments would clarify how much drawing precision is needed for reliable retrieval.
- The same data-collection approach could be applied to create benchmarks for additional non-Western art traditions that share hierarchical visual grammars.
Load-bearing premise
The 2,287 selected Thangka images and their human annotations are representative enough of the full range of Thangka complexity and semantic conventions to serve as a reliable benchmark.
What would settle it
A general-domain CIR model achieving high retrieval accuracy on the CIRThan test split without any Thangka-specific training data or fine-tuning would show that the observed struggles are not inherent to the domain.
Figures
read the original abstract
Composed Image Retrieval (CIR) enables image retrieval by combining multiple query modalities, but existing benchmarks predominantly focus on general-domain imagery and rely on reference images with short textual modifications. As a result, they provide limited support for retrieval scenarios that require fine-grained semantic reasoning, structured visual understanding, and domain-specific knowledge. In this work, we introduce CIRThan, a sketch+text Composed Image Retrieval dataset for Thangka imagery, a culturally grounded and knowledge-specific visual domain characterized by complex structures, dense symbolic elements, and domain-dependent semantic conventions. CIRThan contains 2,287 high-quality Thangka images, each paired with a human-drawn sketch and hierarchical textual descriptions at three semantic levels, enabling composed queries that jointly express structural intent and multi-level semantic specification. We provide standardized data splits, comprehensive dataset analysis, and benchmark evaluations of representative supervised and zero-shot CIR methods. Experimental results reveal that existing CIR approaches, largely developed for general-domain imagery, struggle to effectively align sketch-based abstractions and hierarchical textual semantics with fine-grained Thangka images, particularly without in-domain supervision. We believe CIRThan offers a valuable benchmark for advancing sketch+text CIR, hierarchical semantic modeling, and multimodal retrieval in cultural heritage and other knowledge-specific visual domains. The dataset is publicly available at https://github.com/jinyuxu-whut/CIRThan.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces CIRThan, a sketch+text composed image retrieval dataset for Thangka imagery containing 2,287 high-quality images each paired with a human-drawn sketch and hierarchical textual descriptions at three semantic levels. It supplies standardized splits, dataset analysis, and benchmark evaluations of supervised and zero-shot CIR methods, claiming that existing general-domain approaches struggle to align sketch abstractions and hierarchical textual semantics with fine-grained Thangka images, especially without in-domain supervision.
Significance. If the 2,287 images prove representative of Thangka structural density, symbolic conventions, and semantic hierarchies, the dataset would constitute a valuable benchmark for advancing sketch+text CIR, hierarchical semantic modeling, and multimodal retrieval in cultural heritage domains. The public release and initial results highlighting domain-specific gaps are constructive contributions.
major comments (2)
- [Dataset construction] Dataset construction section: The selection of the 2,287 Thangka images is described as 'high-quality' and 'culturally grounded' but lacks quantitative coverage statistics over motif types, style periods, or compositional complexity. This is load-bearing for the central claim that observed performance gaps reflect intrinsic domain difficulty rather than selection bias.
- [Benchmark evaluations] Benchmark evaluations section: Details on annotation protocols, inter-annotator agreement, quality control procedures, and precise evaluation metrics (e.g., recall@k definitions) are insufficient, limiting full verification of the reported performance claims.
minor comments (1)
- [Dataset analysis] The manuscript would benefit from explicit clarification of the 'comprehensive dataset analysis' metrics and any tables summarizing motif or style distributions.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on our manuscript. We address each major comment point by point below, indicating the revisions we will incorporate to improve clarity and rigor.
read point-by-point responses
-
Referee: [Dataset construction] Dataset construction section: The selection of the 2,287 Thangka images is described as 'high-quality' and 'culturally grounded' but lacks quantitative coverage statistics over motif types, style periods, or compositional complexity. This is load-bearing for the central claim that observed performance gaps reflect intrinsic domain difficulty rather than selection bias.
Authors: We agree that quantitative coverage statistics would strengthen the argument for representativeness and help distinguish domain difficulty from potential selection effects. In the revised manuscript, we will add a dedicated subsection with tables reporting the distribution across motif types (e.g., percentages for Buddha figures, deities, mandalas, and narrative scenes), style periods (e.g., counts from major historical eras such as Yuan, Ming, and Qing), and compositional complexity metrics (e.g., average number of symbolic elements and structural layers per image). These statistics are available from our expert curation logs and will be presented to support the claim that the observed gaps arise from intrinsic Thangka characteristics. revision: yes
-
Referee: [Benchmark evaluations] Benchmark evaluations section: Details on annotation protocols, inter-annotator agreement, quality control procedures, and precise evaluation metrics (e.g., recall@k definitions) are insufficient, limiting full verification of the reported performance claims.
Authors: We acknowledge that expanded details on annotation and evaluation are required for reproducibility. In the revision, we will augment the relevant sections with: (1) explicit annotation protocols for sketch drawing and the three-level hierarchical texts, including guidelines provided to annotators; (2) inter-annotator agreement scores (e.g., Fleiss' kappa computed on a subset of text descriptions); (3) quality control procedures such as expert review rounds and resolution of disagreements; and (4) precise metric definitions, including the exact formulation of recall@k for composed sketch+text queries. These additions will allow full verification of the benchmark results. revision: yes
Circularity Check
No circularity: dataset introduction with direct empirical benchmarks
full rationale
The paper introduces the CIRThan dataset (2,287 Thangka images with sketches and three-level annotations) and reports benchmark results on existing CIR methods. No derivation chain, equations, fitted parameters, or predictions exist that reduce to self-defined inputs. Central claims rest on empirical evaluation rather than self-referential definitions or self-citation load-bearing. The representativeness assumption is a validity concern but does not create circularity in any derivation.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Thangka images require domain-specific knowledge due to complex structures, dense symbolic elements, and domain-dependent semantic conventions.
Reference graph
Works this paper leans on
-
[1]
Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, Humen Zhong, Yuanzhi Zhu, Mingkun Yang, Zhaohai Li, Jianqiang Wan, Pengfei Wang, Wei Ding, Zheren Fu, Yiheng Xu, Jiabo Ye, Xi Zhang, Tianbao Xie, Zesen Cheng, Hang Zhang, Zhibo Yang, Haiyang Xu, and Junyang Lin. 2025. Qwen2.5-VL Technical Rep...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[2]
Alberto Baldrati, Lorenzo Agnolucci, Marco Bertini, and Alberto Del Bimbo. 2023. Zero-Shot Composed Image Retrieval with Textual Inversion. InProceedings of the IEEE/CVF International Conference on Computer Vision. 15292–15301
work page 2023
-
[3]
Alberto Baldrati, Marco Bertini, Tiberio Uricchio, and Alberto Del Bimbo. 2022. Conditioned and composed image retrieval combining and partially fine-tuning clip-based features. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 4959–4968
work page 2022
-
[4]
Alberto Baldrati, Marco Bertini, Tiberio Uricchio, and Alberto Del Bimbo. 2022. Effective conditioned and composed image retrieval combining clip-based fea- tures. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 21466–21474
work page 2022
-
[5]
Alberto Baldrati, Marco Bertini, Tiberio Uricchio, and Alberto Del Bimbo. 2023. Composed Image Retrieval using Contrastive Learning and Task-oriented CLIP- based Features.ACM Transactions on Multimedia Computing, Communications and Applications20, 3 (2023), 1–24
work page 2023
-
[6]
Tu Bui, Leonardo Ribeiro, Moacir Ponti, and John Collomosse. 2018. Sketching out the details: Sketch-based image retrieval using convolutional neural networks with multi-stage regression.Computers & Graphics71 (2018), 77–87
work page 2018
-
[7]
John Collomosse, Tu Bui, and Hailin Jin. 2019. LiveSketch: Query Perturbations for Guided Sketch-Based Visual Search. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2874–2882
work page 2019
-
[8]
Gaohuan Dong, Qing Xie, Jiachen Li, Yanchun Ma, Yuhan Liu, and Yongjian Liu
-
[9]
In Proceedings of the 5th ACM International Conference on Multimedia in Asia
A multi-scale and dense object detector for tibetan thangka images. In Proceedings of the 5th ACM International Conference on Multimedia in Asia. 1–7
-
[10]
Prajwal Gatti, Kshitij Parikh, Dhriti Prasanna Paul, Manish Gupta, and Anand Mishra. 2024. Composite Sketch+Text queries for retrieving objects with elu- sive names and complex interactions. InProceedings of the AAAI Conference on Artificial Intelligence, Vol. 38. 1869–1877
work page 2024
-
[11]
Geonmo Gu, Sanghyuk Chun, Wonjae Kim, Yoohoon Kang, and Sangdoo Yun
-
[12]
InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recog- nition
Language-only Efficient Training of Zero-shot Composed Image Retrieval. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recog- nition. 13225–13234
-
[13]
Anshu Hu, Yifei Sun, Jiachen Li, Yanchun Ma, Qing Xie, and Yongjian Liu. 2025. TOVect: Topology-Optimized Vectorization for Intangible Cultural Heritage Thangka Element Line Art. InProceedings of the 7th ACM International Conference on Multimedia in Asia. 1–7
work page 2025
-
[14]
Yadong Huo, Qibing Qin, Jiangyan Dai, Lei Wang, Wenfeng Zhang, Lei Huang, and Chengduan Wang. 2024. Deep Semantic-Aware Proxy Hashing for Multi- Label Cross-Modal Retrieval.IEEE Transactions on Circuits and Systems for Video Technology34, 1 (2024), 576–589
work page 2024
-
[15]
Chao Jia, Yinfei Yang, Ye Xia, Yi-Ting Chen, Zarana Parekh, Hieu Pham, Quoc Le, Yun-Hsuan Sung, Zhen Li, and Tom Duerig. 2021. Scaling up visual and vision- language representation learning with noisy text supervision. InInternational Conference on Machine Learning. 4904–4916
work page 2021
-
[16]
Ding Jiang and Mang Ye. 2023. Cross-Modal Implicit Relation Reasoning and Aligning for Text-to-Image Person Retrieval. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2787–2797
work page 2023
-
[17]
Xintong Jiang, Yaxiong Wang, Mengjian Li, Yujiao Wu, Bingwen Hu, and Xuem- ing Qian. 2024. CaLa: Complementary Association Learning for Augmenting Comoposed Image Retrieval. InProceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval. 2177–2187
work page 2024
-
[18]
Shyamgopal Karthik, Karsten Roth, Massimiliano Mancini, and Zeynep Akata
-
[19]
International Conference on Learning Representations(2024)
Vision-by-Language for Training-Free Compositional Image Retrieval. International Conference on Learning Representations(2024)
work page 2024
-
[20]
Subhadeep Koley, Ayan Kumar Bhunia, Aneeshan Sain, Pinaki Nath Chowdhury, Tao Xiang, and Yi-Zhe Song. 2024. You’ll Never Walk Alone: A Sketch and Text Duet for Fine-Grained Image Retrieval. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 16509–16519
work page 2024
-
[21]
Jianjun Lei, Yuxin Song, Bo Peng, Zhanyu Ma, Ling Shao, and Yi-Zhe Song. 2020. Semi-Heterogeneous Three-Way Joint Embedding Network for Sketch-Based Image Retrieval.IEEE Transactions on Circuits and Systems for Video Technology 30, 9 (2020), 3226–3237
work page 2020
-
[22]
Jiachen Li, Hongyun Wang, Xiaolong Peng, Jinyu Xu, Qing Xie, Yanchun Ma, Wenbo Jiang, and Mengzi Tang. 2026. Guided by Principles of Composition: A Domain-Specific Priors Based Detector for Recognizing Ritual Implements in Thangka.IET Image Processing20, 1 (2026), e70271
work page 2026
-
[23]
Zixu Li, Zhiwei Chen, Haokun Wen, Zhiheng Fu, Yupeng Hu, and Weili Guan
-
[24]
InProceedings of the AAAI Conference on Artificial Intelligence, Vol
Encoder: Entity mining and modification relation binding for composed image retrieval. InProceedings of the AAAI Conference on Artificial Intelligence, Vol. 39. 5101–5109
-
[25]
Zhixin Ling, Zhen Xing, Jiangtong Li, and Li Niu. 2022. Multi-Level Region Matching for Fine-Grained Sketch-Based Image Retrieval. InProceedings of the 30th ACM International Conference on Multimedia. 462–470
work page 2022
-
[26]
Zheyuan Liu, Cristian Rodriguez-Opazo, Damien Teney, and Stephen Gould
-
[27]
InProceedings of the IEEE/CVF International Conference on Computer Vision
Image retrieval on real-life images with pre-trained vision-and-language models. InProceedings of the IEEE/CVF International Conference on Computer Vision. 2105–2114
-
[28]
Zheyuan Liu, Weixuan Sun, Yicong Hong, Damien Teney, and Stephen Gould
-
[29]
InProceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision
Bi-directional training for composed image retrieval via text prompt learning. InProceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. 5753–5762
-
[30]
Christian Lülf, Denis Mayr Lima Martins, Marcos Antonio Vaz Salles, Yongluan Zhou, and Fabian Gieseke. 2024. CLIP-Branches: Interactive Fine-Tuning for Text- Image Retrieval. InProceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval. 2719–2723
work page 2024
-
[31]
OpenAI. 2024. GPT-4o mini Model. https://platform.openai.com/docs/models/ gpt-4o-mini
work page 2024
-
[32]
OpenAI. 2025. GPT-4.1 Model. https://platform.openai.com/docs/models/gpt-4.1
work page 2025
-
[33]
Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Köpf, Edward Yang, Zach DeVito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. 2019. Pytorch: An imperative style, high-pe...
work page 2019
-
[34]
Yonggang Qi, Yi-Zhe Song, Honggang Zhang, and Jun Liu. 2016. Sketch-based im- age retrieval via Siamese convolutional neural network. In2016 IEEE International Conference on Image Processing. 2460–2464
work page 2016
-
[35]
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. 2021. Learning Transferable Visual Models From Natural Language Supervision. InInternational Conference on Machine Learning. 8748–8763
work page 2021
-
[36]
Hui Ren, Ke Sun, Fanhua Zhao, and Xian Zhu. 2024. Dunhuang murals image restoration method based on generative adversarial network.Heritage Science12, 1 (2024), 39
work page 2024
-
[37]
Aneeshan Sain, Ayan Kumar Bhunia, Subhadeep Koley, Pinaki Nath Chowdhury, Soumitri Chattopadhyay, Tao Xiang, and Yi-Zhe Song. 2023. Exploiting Unla- belled Photos for Stronger Fine-Grained SBIR. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 6873–6883
work page 2023
-
[38]
Kuniaki Saito, Kihyuk Sohn, Xiang Zhang, Chun-Liang Li, Chen-Yu Lee, Kate Saenko, and Tomas Pfister. 2023. Pic2Word: Mapping Pictures to Words for Zero-shot Composed Image Retrieval. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 19305–19314
work page 2023
-
[39]
Omar Seddati, Stéphane Dupont, and Saïd Mahmoudi. 2017. Triplet Networks Feature Masking for Sketch-Based Image Retrieval. InInternational Conference on Image Analysis and Recognition. 296–303
work page 2017
-
[40]
Piyush Sharma, Nan Ding, Sebastian Goodman, and Radu Soricut. 2018. Concep- tual Captions: A Cleaned, Hypernymed, Image Alt-text Dataset For Automatic Image Captioning. InProceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2556–2565
work page 2018
-
[41]
Haifeng Sun, Jiaqing Xu, Jingyu Wang, Qi Qi, Ce Ge, and Jianxin Liao. 2022. DLI-Net: Dual Local Interaction Network for Fine-Grained Sketch-Based Image Retrieval.IEEE Transactions on Circuits and Systems for Video Technology32, 10 (2022), 7177–7189
work page 2022
- [42]
-
[43]
Yi Sun, Jinyu Xu, Qing Xie, Jiachen Li, Yanchun Ma, and Yongjian Liu. 2026. SDR-CIR: Semantic Debias Retrieval Framework for Training-Free Zero-Shot Composed Image Retrieval. InProceedings of the ACM Web Conference 2026. 2149–2159
work page 2026
-
[44]
Zelong Sun, Dong Jing, and Zhiwu Lu. 2025. CoTMR: Chain-of-Thought Multi- Scale Reasoning for Training-Free Zero-Shot Composed Image Retrieval. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 22675– 22684
work page 2025
-
[45]
Yucheng Suo, Fan Ma, Linchao Zhu, and Yi Yang. 2024. Knowledge-Enhanced Dual-Stream Zero-Shot Composed Image Retrieval. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 26941–26952
work page 2024
-
[46]
Yuanmin Tang, Jing Yu, Keke Gai, Jiamin Zhuang, Gang Xiong, Yue Hu, and Qi Wu
-
[47]
InProceedings of the AAAI Conference on Artificial Intelligence, Vol
Context-I2W: mapping images to context-dependent words for accurate zero-shot composed image retrieval. InProceedings of the AAAI Conference on Artificial Intelligence, Vol. 38. 5180–5188
-
[48]
Yuanmin Tang, Jue Zhang, Xiaoting Qin, Jing Yu, Gaopeng Gou, Gang Xiong, Qingwei Lin, Saravan Rajmohan, Dongmei Zhang, and Qi Wu. 2025. Reason- before-Retrieve: One-Stage Reflective Chain-of-Thoughts for Training-Free Zero- Shot Composed Image Retrieval. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 14400–14410. A Sk...
work page 2025
-
[49]
Sagar Vaze, Nicolas Carion, and Ishan Misra. 2023. GeneCIS: A Benchmark for General Conditional Image Similarity. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 6862–6872
work page 2023
-
[50]
Nam Vo, Lu Jiang, Chen Sun, Kevin Murphy, Li-Jia Li, Li Fei-Fei, and James Hays
-
[51]
In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition
Composing Text and Image for Image Retrieval - an Empirical Odyssey. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 6432–6441
-
[52]
Lan Wang, Wei Ao, Vishnu Naresh Boddeti, and Ser-Nam Lim. 2025. Generative Zero-Shot Composed Image Retrieval. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 29690–29700
work page 2025
-
[53]
Nianyi Wang, Weilan Wang, Wenjin Hu, Aaron Fenster, and Shuo Li. 2021. Thanka Mural Inpainting Based on Multi-Scale Adaptive Partial Convolution and Stroke- Like Mask.IEEE Transactions on Image Processing30 (2021), 3720–3733
work page 2021
-
[54]
Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Yang Fan, Kai Dang, Mengfei Du, Xuancheng Ren, Rui Men, Dayiheng Liu, Chang Zhou, Jingren Zhou, and Junyang Lin. 2024. Qwen2-VL: Enhancing Vision-Language Model’s Perception of the World at Any Resolution.arXiv preprint arXiv:2409.12191(2024)
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[55]
Hui Wu, Yupeng Gao, Xiaoxiao Guo, Ziad Al-Halah, Steven Rennie, Kristen Grauman, and Rogerio Feris. 2021. Fashion IQ: A New Dataset Towards Retrieving Images by Natural Language Feedback. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 11302–11312
work page 2021
-
[56]
Jinyu Xu, Qing Xie, Jiachen Li, Zhifeng Bao, Yanchun Ma, and Yongjian Liu
-
[57]
Enhancing Fine-Grained Sketch-based Image Retrieval through Contextual Information.IEEE Transactions on Multimedia(2026), 1–12
work page 2026
-
[58]
Zhenyu Yang, Shengsheng Qian, Dizhan Xue, Jiahong Wu, Fan Yang, Weiming Dong, and Changsheng Xu. 2024. Semantic Editing Increment Benefits Zero- Shot Composed Image Retrieval. InProceedings of the 32nd ACM International Conference on Multimedia. 1245–1254
work page 2024
-
[59]
Zhenyu Yang, Dizhan Xue, Shengsheng Qian, Weiming Dong, and Changsheng Xu. 2024. LDRE: LLM-based Divergent Reasoning and Ensemble for Zero-Shot Composed Image Retrieval. InProceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval. 80–90
work page 2024
-
[60]
Lewei Yao, Runhui Huang, Lu Hou, Guansong Lu, Minzhe Niu, Hang Xu, Xiaodan Liang, Zhenguo Li, Xin Jiang, and Chunjing Xu. 2022. FILIP: Fine-grained In- teractive Language-Image Pre-Training. InInternational Conference on Learning Representations
work page 2022
-
[61]
Ying Zheng, Hongxun Yao, and Xiaoshuai Sun. 2021. Deep Semantic Parsing of Freehand Sketches With Homogeneous Transformation, Soft-Weighted Loss, and Staged Learning.IEEE Transactions on Multimedia23 (2021), 3590–3602
work page 2021
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.