pith. sign in

arxiv: 2604.19135 · v1 · submitted 2026-04-21 · 💻 cs.CV

Diff-SBSR: Learning Multimodal Feature-Enhanced Diffusion Models for Zero-Shot Sketch-Based 3D Shape Retrieval

Pith reviewed 2026-05-10 03:12 UTC · model grok-4.3

classification 💻 cs.CV
keywords zero-shot sketch-based retrieval3D shape retrievaldiffusion modelsmultimodal feature enhancementCLIP visual featuresBLIP text guidanceCircle-T lossfrozen backbone
0
0 comments X

The pith

A frozen Stable Diffusion model enhanced with CLIP and BLIP features retrieves 3D shapes from sketches without any category supervision.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper explores using pretrained text-to-image diffusion models for retrieving 3D shapes from sketches in a zero-shot setting where no category labels are available. Existing methods struggle here because sketches are extremely abstract and sparse while lacking supervision prevents learning alignments between 2D inputs and 3D objects. The authors demonstrate that a frozen Stable Diffusion backbone can supply useful shape-biased features when its intermediate U-Net layers are conditioned on global and local visual cues from CLIP plus textual guidance from BLIP-generated descriptions combined with soft prompts. A Circle-T loss further helps by pulling positive sketch-3D pairs closer once negatives are separated. If this works, retrieval systems could operate on entirely new object categories without collecting labeled training data for each one.

Core claim

Large-scale pretrained diffusion models exhibit open-vocabulary capability and strong shape bias that suit zero-shot visual retrieval. A frozen Stable Diffusion backbone extracts and aggregates discriminative representations from intermediate U-Net layers for both sketches and rendered 3D views. To bridge the domain gap without retraining, a multimodal feature-enhanced strategy injects global and local visual features from a pretrained CLIP encoder and incorporates enriched textual guidance from learnable soft prompts plus hard textual descriptions generated by BLIP. The Circle-T loss dynamically strengthens positive-pair attraction once negative samples are separated. Experiments on two公共基准

What carries the argument

Multimodal feature-enhanced conditioning of a frozen Stable Diffusion U-Net that aggregates intermediate layer representations for sketches and 3D views while injecting CLIP visual features and BLIP text cues.

If this is right

  • 3D shape retrieval becomes feasible for object categories never seen during any training phase.
  • The system focuses on sketch contours and semantic context despite high abstraction by using injected multimodal cues.
  • Dynamic adjustment of positive-pair attraction adapts alignment to the noise present in hand-drawn sketches.
  • Consistent gains appear across multiple standard benchmarks without task-specific fine-tuning of the backbone.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same frozen-backbone strategy with multimodal injections could extend to other zero-shot cross-modal tasks such as text queries to 3D shapes.
  • If the enhancements prove stable, similar conditioning might reduce the need for full retraining when applying diffusion models to sparse or abstract inputs in related vision problems.
  • Direct processing of 3D data instead of rendered views could be tested as a next step if suitable encoders are paired with the same loss and conditioning approach.

Load-bearing premise

That adding CLIP visual features and BLIP text to a frozen diffusion backbone is enough to overcome the extreme domain gap and sparsity of sketches without any retraining of the model.

What would settle it

Running the method on the same two public benchmarks and finding that it fails to outperform existing zero-shot sketch-based 3D retrieval approaches on retrieval metrics for unseen categories.

Figures

Figures reproduced from arXiv: 2604.19135 by Fanhe Dong, Hang Cheng, Long Zeng.

Figure 1
Figure 1. Figure 1: Existing method and our method. zero-shot generalization to unseen categories [7, 28]. In particu￾lar, text-to-image diffusion models have been shown to effectively bridge the modality gap between sketches and photos, benefiting from strong cross-modal alignment and an inherent shape bias, as also observed in recent work [32, 33]. Although Stable Diffusion (SD) has demonstrated impressive feature extractio… view at source ↗
Figure 2
Figure 2. Figure 2: Our method extracts multi-scale features from a frozen Stable Diffusion model for sketches and rendered 3D views, [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Effect of timesteps choice of diffusion model on the [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: PCA representation of SD’s intermediate UNet fea [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗
Figure 6
Figure 6. Figure 6: Confusion matrix of retrieval results under circle-T [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Visualization of retrieval examplesl on SHREC2013 [PITH_FULL_IMAGE:figures/full_fig_p010_7.png] view at source ↗
read the original abstract

This paper presents the first exploration of text-to-image diffusion models for zero-shot sketch-based 3D shape retrieval (ZS-SBSR). Existing sketch-based 3D shape retrieval methods struggle in zero-shot settings due to the absence of category supervision and the extreme sparsity of sketch inputs. Our key insight is that large-scale pretrained diffusion models inherently exhibit open-vocabulary capability and strong shape bias, making them well suited for zero-shot visual retrieval. We leverage a frozen Stable Diffusion backbone to extract and aggregate discriminative representations from intermediate U-Net layers for both sketches and rendered 3D views. Diffusion models struggle with sketches due to their extreme abstraction and sparsity, compounded by a significant domain gap from natural images. To address this limitation without costly retraining, we introduce a multimodal feature-enhanced strategy that conditions the frozen diffusion backbone with complementary visual and textual cues from CLIP, explicitly enhancing the ability of semantic context capture and concentrating on sketch contours. Specifically, we inject global and local visual features derived from a pretrained CLIP visual encoder, and incorporate enriched textual guidance by combining learnable soft prompts with hard textual descriptions generated by BLIP. Furthermore, we employ the Circle-T loss to dynamically strengthen positive-pair attraction once negative samples are sufficiently separated, thereby adapting to sketch noise and enabling more effective sketch-3D alignment. Extensive experiments on two public benchmarks demonstrate that our method consistently outperforms state-of-the-art approaches in ZS-SBSR.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 3 minor

Summary. The manuscript presents Diff-SBSR, the first application of text-to-image diffusion models to zero-shot sketch-based 3D shape retrieval (ZS-SBSR). It freezes a Stable Diffusion U-Net backbone, conditions it multimodally by injecting global/local CLIP visual features and BLIP-generated text augmented with learnable soft prompts, extracts and aggregates intermediate U-Net layer representations for sketches and rendered 3D views, and optimizes with a Circle-T loss that strengthens positive-pair attraction after negatives are separated. Extensive experiments on two public benchmarks are reported to show consistent outperformance over prior SOTA methods.

Significance. If the performance gains are reproducible and the diffusion backbone demonstrably contributes shape bias beyond the CLIP/BLIP conditioning, the work would establish a practical route for repurposing large frozen generative models in sparse-input zero-shot retrieval without retraining. The emphasis on no-backbone fine-tuning and adaptation to sketch noise via the loss are pragmatic strengths that could influence follow-on work in cross-modal retrieval.

major comments (3)
  1. [§3] §3 (multimodal conditioning and U-Net feature extraction): The central claim that intermediate U-Net activations supply discriminative contour/shape information beyond the injected CLIP visual features and BLIP text is load-bearing for the key insight, yet no ablation isolates the U-Net contribution (e.g., conditioned U-Net features vs. direct CLIP embeddings on sketches). Without this, it remains possible that gains derive primarily from the external conditioning signals rather than the diffusion prior.
  2. [§4] §4 (experiments and tables): The reported outperformance on the two benchmarks is presented without statistical significance tests, standard deviations across runs, or explicit confirmation that baselines were re-implemented with identical protocols and hyperparameter tuning. This weakens the strength of the SOTA claim, especially given the domain gap and sparsity issues highlighted in the abstract.
  3. [§3.3] §3.3 (Circle-T loss): The loss is motivated as adapting to sketch noise, but the manuscript provides no sensitivity analysis on its hyperparameters or comparison against standard contrastive losses under the same multimodal conditioning, leaving unclear whether the dynamic positive-pair strengthening is essential to the reported gains.
minor comments (3)
  1. [Figure 2] The architecture diagram (Figure 2) would benefit from explicit arrows and labels indicating where CLIP local/global features and BLIP text are injected into the U-Net.
  2. [§3.2] Notation for feature aggregation across U-Net layers (e.g., the pooling or concatenation operation) is introduced without a compact equation or pseudocode, complicating reproducibility.
  3. [§2] Related-work discussion of prior diffusion-based retrieval or sketch-3D methods could be expanded with more recent citations to contextualize the novelty claim.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We sincerely thank the referee for the detailed and constructive feedback on our manuscript. We address each of the major comments below and outline the revisions we plan to make to strengthen the paper.

read point-by-point responses
  1. Referee: [§3] §3 (multimodal conditioning and U-Net feature extraction): The central claim that intermediate U-Net activations supply discriminative contour/shape information beyond the injected CLIP visual features and BLIP text is load-bearing for the key insight, yet no ablation isolates the U-Net contribution (e.g., conditioned U-Net features vs. direct CLIP embeddings on sketches). Without this, it remains possible that gains derive primarily from the external conditioning signals rather than the diffusion prior.

    Authors: We agree that an explicit ablation isolating the contribution of the U-Net features is important to substantiate the role of the diffusion prior. In the revised manuscript, we will include an additional ablation study comparing the performance using only the injected CLIP and BLIP features against the full model that extracts and aggregates intermediate U-Net layer representations. This will clarify the incremental benefit provided by the frozen diffusion backbone. revision: yes

  2. Referee: [§4] §4 (experiments and tables): The reported outperformance on the two benchmarks is presented without statistical significance tests, standard deviations across runs, or explicit confirmation that baselines were re-implemented with identical protocols and hyperparameter tuning. This weakens the strength of the SOTA claim, especially given the domain gap and sparsity issues highlighted in the abstract.

    Authors: We acknowledge the importance of statistical rigor in reporting results. We will re-implement all baseline methods using the same experimental protocols and hyperparameter settings as described in our paper. Additionally, we will conduct multiple runs with different random seeds to report mean performance with standard deviations and perform statistical significance tests (e.g., paired t-tests) to validate the improvements. These details will be added to the experimental section and tables in the revised version. revision: yes

  3. Referee: [§3.3] §3.3 (Circle-T loss): The loss is motivated as adapting to sketch noise, but the manuscript provides no sensitivity analysis on its hyperparameters or comparison against standard contrastive losses under the same multimodal conditioning, leaving unclear whether the dynamic positive-pair strengthening is essential to the reported gains.

    Authors: We appreciate this point regarding the necessity of the Circle-T loss. In the revision, we will provide a sensitivity analysis on the key hyperparameters of the Circle-T loss, such as the margin and temperature parameters. Furthermore, we will include a direct comparison against standard contrastive losses (e.g., InfoNCE) under identical multimodal conditioning and backbone settings to demonstrate the advantages of the dynamic positive-pair strengthening mechanism. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation relies on external pretrained models and novel adaptations.

full rationale

The paper's central derivation introduces a multimodal conditioning strategy (CLIP visual features + BLIP text) and Circle-T loss applied to a frozen Stable Diffusion U-Net, without any equations or steps that reduce claimed representations or performance to quantities fitted directly on the ZS-SBSR benchmarks. No self-citations are load-bearing for uniqueness theorems, no ansatzes are smuggled via prior author work, and no predictions are statistically forced by input fitting. The approach is self-contained against external benchmarks via new loss and conditioning, with experiments providing independent validation.

Axiom & Free-Parameter Ledger

2 free parameters · 2 axioms · 0 invented entities

The method rests on the assumption that pretrained diffusion models already possess useful shape bias and open-vocabulary properties for sketches once conditioned; it introduces learnable soft prompts and Circle-T loss hyperparameters whose values are fitted during training but not specified in the abstract.

free parameters (2)
  • learnable soft prompts
    Additional prompt tokens whose weights are learned to enrich textual guidance from BLIP descriptions.
  • Circle-T loss hyperparameters
    Parameters controlling the dynamic attraction of positive pairs once negatives are separated.
axioms (2)
  • domain assumption Large-scale pretrained diffusion models inherently exhibit open-vocabulary capability and strong shape bias suitable for zero-shot visual retrieval.
    Stated as the key insight in the abstract and used to justify freezing the Stable Diffusion backbone.
  • domain assumption CLIP visual features and BLIP-generated text provide complementary cues that close the domain gap for sparse sketches without retraining the diffusion model.
    Central premise for the multimodal feature-enhanced strategy.

pith-pipeline@v0.9.0 · 5565 in / 1488 out tokens · 36480 ms · 2026-05-10T03:12:54.137690+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

79 extracted references · 79 canonical work pages

  1. [1]

    Shaojin Bai and Jing Bai. 2023. HDA2L: Hierarchical domain-augmented adaptive learning for sketch-based 3D shape retrieval.Knowledge-Based Systems264 (2023), 110302

  2. [2]

    Shaojin Bai, Jing Bai, Hao Xu, Jiwen Tuo, and Min Liu. 2023. PAGML: Precise alignment guided metric learning for sketch-based 3D shape retrieval.Image and Vision Computing136 (2023), 104756

  3. [3]

    Shaojin Bai, Yalu Li, Rihao Chang, Qi Liang, and Weizhi Nie. 2025. SCDL: Sketch Causal Disentangled Learning for Sketch-based 3D Shape Retrieval.IEEE Trans- actions on Circuits and Systems for Video Technology(2025)

  4. [4]

    Hospedales, and Yi-Zhe Song

    Hmrishav Bandyopadhyay, Pinaki Nath Chowdhury, Ayan Kumar Bhunia, Anee- shan Sain, Tao Xiang, Timothy M. Hospedales, and Yi-Zhe Song. 2024. SketchINR: A first look into sketches as implicit neural representations. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 12565–12574

  5. [5]

    Hmrishav Bandyopadhyay, Pinaki Nath Chowdhury, Ayan Kumar Bhunia, Anee- shan Sain, Tao Xiang, and Yi-Zhe Song. 2024. What Sketch Explainability Really Means for Downstream Tasks?. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 10997–11008

  6. [6]

    Hmrishav Bandyopadhyay, Subhadeep Koley, Ayan Kumar Bhunia, et al. 2024. Doodle Your 3D: From Abstract Freehand Sketches to Precise 3D Shapes. In CVPR

  7. [7]

    Dmitry Baranchuk, Ivan Rubachev, Andrey Voynov, et al. 2021. Label-Efficient Semantic Segmentation with Diffusion Models. InICLR

  8. [8]

    Yiyang Cai, Jiaming Lu, Jiewen Wang, and Shuang Liang. 2023. Uncertainty- aware cross-modal transfer network for sketch-based 3D shape retrieval. InIEEE International Conference on Multimedia and Expo. 132–137

  9. [9]

    Mathilde Caron, Hugo Touvron, Ishan Misra, Hervé Jégou, Julien Mairal, Piotr Bojanowski, and Armand Joulin. 2021. Emerging properties in self-supervised vision transformers. InProceedings of the IEEE/CVF international conference on computer vision. 9650–9660

  10. [10]

    Haoxin Chen, Yong Zhang, Xiaodong Cun, et al. 2024. VideoCrafter2: Overcoming Data Limitations for High-Quality Video Diffusion Models. InCVPR

  11. [11]

    Jie Chen and Yi Fang. 2018. Deep Cross-Modality Adaptation via Semantics Pre- serving Adversarial Learning for Sketch-Based 3D Shape Retrieval. InProceedings of the European Conference on Computer Vision. 605–620

  12. [12]

    Liang Chen et al. 2023. Masked Reconstruction in Diffusion Models. InNeurIPS

  13. [13]

    Gene Chou, Yuval Bahat, and Felix Heide. 2023. Diffusion-SDF: Conditional Generative Modeling of Signed Distance Functions. InCVPR

  14. [14]

    Pinaki Nath Chowdhury et al. 2023. Democratising 2D Sketch to 3D Shape Re- trieval through Pivoting. InProceedings of the IEEE/CVF International Conference on Computer Vision

  15. [15]

    Pinaki Nath Chowdhury, Ayan Kumar Bhunia, et al. 2023. What Can Human Sketches Do for Object Detection?. InCVPR

  16. [16]

    Guoxian Dai, Jin Xie, and Yi Fang. 2018. Deep correlated holistic metric learning for sketch-based 3D shape retrieval.IEEE Transactions on Image Processing27, 7 (2018), 3374–3386

  17. [17]

    Guoxian Dai, Jin Xie, Fan Zhu, and Yi Fang. 2017. Deep correlated metric learning for sketch-based 3D shape retrieval. InProceedings of the AAAI Conference on Artificial Intelligence, Vol. 31

  18. [18]

    Weidong Dai and Shuang Liang. 2020. Cross-modal guidance network for sketch- based 3D shape retrieval. InIEEE International Conference on Multimedia and Expo. 1–6

  19. [19]

    Tal Darom and Yosi Keller. 2012. Scale-invariant features for 3-D mesh models. IEEE Transactions on Image Processing21, 5 (2012), 2758–2769

  20. [20]

    Bram de Wilde, Anindo Saha, et al. 2024. Medical Diffusion on a Budget: Textual Inversion for Medical Image Generation. InMIDL

  21. [21]

    Cheng Deng, Xinxun Xu, Hao Wang, Muli Yang, and Dacheng Tao. 2020. Progres- sive cross-modal semantic network for zero-shot sketch-based image retrieval. IEEE Transactions on Image Processing29 (2020), 8892–8902

  22. [22]

    Prafulla Dhariwal and Alexander Nichol. 2021. Diffusion Models Beat GANs on Image Synthesis. InNeurIPS

  23. [23]

    Carlos Esteves, Christine Allen-Blanchette, Ameesh Makadia, and Kostas Dani- ilidis. 2018. Learning so (3) equivariant representations with spherical cnns. In Proceedings of the european conference on computer vision (ECCV). 52–68

  24. [24]

    Cusuh Ham, Gemma Canet Tarres, et al. 2022. CoGS: Controllable Generation and Search from Sketch and Style. InECCV

  25. [25]

    Xinwei He, Yang Zhou, Zhichao Zhou, Song Bai, and Xiang Bai. 2018. Triplet- center loss for multi-view 3d object retrieval. InProceedings of the IEEE conference on computer vision and pattern recognition. 1945–1954

  26. [26]

    Amir Hertz, Ron Mokady, Jay Tenenbaum, et al. 2022. Prompt-to-Prompt Image Editing with Cross Attention Control. InICLR

  27. [27]

    Jonathan Ho, Ajay Jain, and Pieter Abbeel. 2020. Denoising Diffusion Probabilistic Models. InNeurIPS

  28. [28]

    Conghui Hu, Da Li, Yongxin Yang, et al. 2020. Sketch-a-Segmenter: Sketch-Based Photo Segmenter Generation.IEEE TIP(2020)

  29. [29]

    Hudson, Daniel Zoran, et al

    Drew A. Hudson, Daniel Zoran, et al. 2024. SODA: Bottleneck Diffusion Models for Representation Learning. InCVPR

  30. [30]

    Bahjat Kawar, Shiran Zada, Oran Lang, et al. 2023. Imagic: Text-Based Real Image Editing with Diffusion Models. InCVPR

  31. [31]

    Roman Klokov and Victor Lempitsky. 2017. Escape from cells: Deep kd-networks for the recognition of 3d point cloud models. InProceedings of the IEEE interna- tional conference on computer vision. 863–872

  32. [32]

    Subhadeep Koley, Ayan Kumar Bhunia, Aneeshan Sain, Pinaki Nath Chowd- hury, Tao Xiang, and Yi-Zhe Song. 2024. Text-to-image diffusion models are great sketch-photo matchmakers. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 16826–16837

  33. [33]

    Subhadeep Koley, Tapas Kumar Dutta, Aneeshan Sain, Pinaki Nath Chowdhury, Ayan Kumar Bhunia, and Yi-Zhe Song. 2025. SketchFusion: Learning Univer- sal Sketch Features through Fusing Foundation Models. InProceedings of the Computer Vision and Pattern Recognition Conference. 2556–2567

  34. [34]

    Y. Lei, Z. Zhou, P. Zhang, P. Guo, Z. Ma, and L. Liu. 2019. Deep Point-to-Subspace Metric Learning for Sketch-Based 3D Shape Retrieval.Pattern Recognition96 (2019), 106–116

  35. [35]

    Bo Li, Yijuan Lu, Afzal Godil, Thomas Schreck, et al . 2014. A Comparison of Methods for Sketch-Based 3D Shape Retrieval.Computer Vision and Image Understanding119, 6 (2014), 57–80

  36. [36]

    Johan, J

    Bo Li, Yijuan Lu, Afzal Godil, Thomas Schreck, Makoto Aono, H. Johan, J. Saave- dra, and S. Tashiro. 2013. SHREC’13 Track: Large Scale Sketch-Based 3D Shape Retrieval. InEurographics Workshop on 3D Object Retrieval. 89–96

  37. [37]

    Bo Li, Yijuan Lu, Chen Li, Afzal Godil, et al. 2014. SHREC’14 Track: Extended Large Scale Sketch-Based 3D Shape Retrieval. InEurographics Workshop on 3D Object Retrieval. 121–130

  38. [38]

    Junnan Li, Dongxu Li, Caiming Xiong, et al. 2022. BLIP: Bootstrapping Language- Image Pre-training for Unified Vision-Language Understanding and Generation. InInternational Conference on Machine Learning. 12888–12900

  39. [39]

    Xue Li, Jiong Yu, Ziyang Li, Hongchun Lu, and Ruifeng Yuan. 2024. Dr. clip: Clip- driven universal framework for zero-shot sketch image retrieval. InProceedings of the 32nd ACM international conference on multimedia. 9554–9562

  40. [40]

    Shuang Liang, Weidong Dai, Yiyang Cai, and Chi Xie. 2024. Sketch-based 3D shape retrieval via teacher–student learning.Computer Vision and Image Under- standing239 (2024), 103903

  41. [41]

    Shuang Liang, Weidong Dai, and Yichen Wei. 2021. Uncertainty learning for noise resistant sketch-based 3D shape retrieval.IEEE Transactions on Image Processing30 (2021), 8632–8643

  42. [42]

    Fengyin Lin, Mingkang Li, Da Li, Timothy Hospedales, Yi-Zhe Song, and Yong- gang Qi. 2023. Zero-shot everything sketch-based image retrieval, and in ex- plainable style. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 23349–23358

  43. [43]

    Daniel Maturana and Sebastian Scherer. 2015. Voxnet: A 3d convolutional neural network for real-time object recognition. In2015 IEEE/RSJ international conference on intelligent robots and systems (IROS). Ieee, 922–928

  44. [44]

    Min Meng, Wenhang Chen, Jigang Liu, Jun Yu, and Jigang Wu. 2025. CoDi: Contrastive Disentanglement Generative Adversarial Networks for Zero-Shot Sketch-Based 3D Shape Retrieval.IEEE Transactions on Circuits and Systems for Video Technology35, 2 (2025), 1910–1920. doi:10.1109/TCSVT.2024.3472036

  45. [45]

    Hospedales, Tao Xiang, and Yi-Zhe Song

    Anran Qi, Yulia Gryaditskaya, Jeifei Song, Yongxin Yang, Yonggang Qi, Timo- thy M. Hospedales, Tao Xiang, and Yi-Zhe Song. 2021. Toward Fine-Grained Diff-SBSR: Learning Multimodal Feature-Enhanced Diffusion Models for Zero-Shot Sketch-Based 3D Shape Retrieval SIGIR ’26, June 03–05, 2026, Melbourne, Australia Sketch-Based 3D Shape Retrieval.IEEE Transactio...

  46. [46]

    Anran Qi, Yi-Zhe Song, and Tao Xiang. 2018. Semantic Embedding for Sketch- Based 3D Shape Retrieval. InBritish Machine Vision Conference, Vol. 3. 11–12

  47. [47]

    Charles R Qi, Hao Su, Kaichun Mo, and Leonidas J Guibas. 2017. Pointnet: Deep learning on point sets for 3d classification and segmentation. InProceedings of the IEEE conference on computer vision and pattern recognition. 652–660

  48. [48]

    Charles Ruizhongtai Qi, Li Yi, Hao Su, and Leonidas J Guibas. 2017. Pointnet++: Deep hierarchical feature learning on point sets in a metric space.Advances in neural information processing systems30 (2017)

  49. [49]

    Jie Qin, Shuaihang Yuan, Jiaxin Chen, Boulbaba Ben Amor, Yi Fang, Nhat Hoang- Xuan, Chi-Bien Chu, Khoi-Nguyen Nguyen-Ngoc, Thien-Tri Cao, Nhat-Khang Ngo, et al. 2022. SHREC’22 track: Sketch-based 3D shape retrieval in the wild. Computers & Graphics107 (2022), 104–115

  50. [50]

    Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. 2022. High-Resolution Image Synthesis with Latent Diffusion Models. InCVPR

  51. [51]

    Nataniel Ruiz, Yuanzhen Li, et al. 2023. DreamBooth: Fine Tuning Text-to-Image Diffusion Models for Subject-Driven Generation. InCVPR

  52. [52]

    J. M. Saavedra, B. Bustos, T. Schreck, S. M. Yoon, and M. Scherer. 2012. Sketch- Based 3D Model Retrieval Using Keyshapes for Global and Local Representation. In3D Object Retrieval Workshop at Eurographics. 47–50

  53. [53]

    Aneeshan Sain et al . 2023. SD-PL: Diffusion Models for Sketch-Based Image Retrieval. InCVPR

  54. [54]

    Aneeshan Sain, Ayan Kumar Bhunia, Pinaki Nath Chowdhury, Subhadeep Koley, Tao Xiang, and Yi-Zhe Song. 2023. Clip for all things zero-shot sketch-based image retrieval, fine-grained or not. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2765–2775

  55. [55]

    Aneeshan Sain, Ayan Kumar Bhunia, Vaishnav Potlapalli, Pinaki Nath Chowd- hury, Tao Xiang, and Yi-Zhe Song. 2022. Sketch3t: Test-time training for zero-shot sbir. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 7462–7471

  56. [56]

    Mainak Singha, Ankit Jha, Divyam Gupta, Pranav Singla, and Biplab Banerjee

  57. [57]

    InEuropean Conference on Computer Vision

    Elevating all zero-shot sketch-based image retrieval through multimodal prompt learning. InEuropean Conference on Computer Vision. Springer, 1–19

  58. [58]

    Hang Su, Subhransu Maji, Evangelos Kalogerakis, and Erik Learned-Miller. 2015. Multi-view convolutional neural networks for 3d shape recognition. InProceed- ings of the IEEE international conference on computer vision. 945–953

  59. [59]

    Yawen Su, Jing Bai, and Gan Lin. 2025. DKD 2 L: Dual Knowledge Distillation Dynamic Learning for sketch-based 3D shape retrieval. InICASSP 2025-2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 1–5

  60. [60]

    Yawen Su, Wenjing Li, Jing Bai, and Gan Lin. 2025. SKD-SBSR: Structural Knowl- edge Distillation for Sketch-Based 3D Shape Retrieval.Knowledge-Based Systems 310 (2025), 112891

  61. [61]

    Yifan Sun, Changmao Cheng, Yuhan Zhang, Chi Zhang, Liang Zheng, Zhongdao Wang, and Yichen Wei. 2020. Circle loss: A unified perspective of pair similarity optimization. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition. 6398–6407

  62. [62]

    Jialin Tian, Xing Xu, Zheng Wang, Fumin Shen, and Xin Liu. 2021. Relationship- preserving knowledge distillation for zero-shot sketch based image retrieval. In Proceedings of the 29th ACM international conference on multimedia. 5473–5481

  63. [63]

    Bingrui Wang and Yuan Zhou. 2023. Doodle to Object: Practical Zero-Shot Sketch- Based 3D Shape Retrieval. InProceedings of the AAAI Conference on Artificial Intelligence, Vol. 37. 2474–2482

  64. [64]

    Fang Wang, Le Kang, and Yi Li. 2015. Sketch-based 3D shape retrieval using convolutional neural networks. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1875–1883

  65. [65]

    Peng-Shuai Wang, Yang Liu, Yu-Xiao Guo, Chun-Yu Sun, and Xin Tong. 2017. O-cnn: Octree-based convolutional neural networks for 3d shape analysis.ACM Transactions On Graphics (TOG)36, 4 (2017), 1–11

  66. [66]

    Xinyu Wang et al. 2023. Test-Time Adaptation for Diffusion Models. InICCV

  67. [67]

    Sanghyun Woo, Jongchan Park, Joon-Young Lee, and In So Kweon. 2018. Pro- ceedings of the European conference on computer vision (ECCV). InProceedings of the European conference on computer vision (ECCV), Vol. 3. 8

  68. [68]

    Wong, and Yi Fang

    Jin Xie, Guoxian Dai, Fan Zhu, Edward K. Wong, and Yi Fang. 2016. Deepshape: Deep-learned shape descriptor for 3D shape retrieval.IEEE Transactions on Pattern Analysis and Machine Intelligence39, 7 (2016)

  69. [69]

    R. Xu, Z. Han, L. Hui, J. Qian, and J. Xie. 2022. Domain Disentangled Gener- ative Adversarial Network for Zero-Shot Sketch-Based 3D Shape Retrieval. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 36. 2902–2910

  70. [70]

    Yongzhe Xu, Jiangchuan Hu, Kanoksak Wattanachote, Kun Zeng, and YongYi Gong. 2020. Sketch-based shape retrieval via best view selection and a cross- domain similarity measure.IEEE Transactions on Multimedia22, 11 (2020), 2950– 2962

  71. [71]

    Sang Min Yoon, Maximilian Scherer, Tobias Schreck, and Arjan Kuijper. 2010. Sketch-based 3D model retrieval using diffusion tensor fields of suggestive con- tours. InProceedings of the 18th ACM international conference on Multimedia. 193–200

  72. [72]

    Qian Yu, Feng Liu, Yi-Zhe Song, Tao Xiang, Timothy M Hospedales, and Chen- Change Loy. 2016. Sketch me that shoe. InProceedings of the IEEE conference on computer vision and pattern recognition. 799–807

  73. [73]

    Shuaihang Yuan, Congcong Wen, Yu-Shen Liu, and Yi Fang. 2023. Retrieval- specific view learning for sketch-to-shape retrieval.IEEE Transactions on Multi- media27 (2023), 768–779

  74. [74]

    Long Zeng, Zhi-kai Dong, Jia-yi Yu, Jun Hong, and Hong-yu Wang. 2019. Sketch- based retrieval and instantiation of parametric parts.Computer-Aided Design113 (2019), 82–95

  75. [75]

    Donglin Zhang, Changxing Li, and Xiao-Jun Wu. 2025. Multi-level Encoding with Hierarchical Alignment for Sketch-Based 3D Shape Retrieval. InProceedings of the 48th International ACM SIGIR Conference on Research and Development in Information Retrieval. 1033–1043

  76. [76]

    Y. Zhao, Q. Liang, R. Ma, W. Nie, and Y. Su. 2022. JFLN: Joint Feature Learning Net- work for 2D Sketch Based 3D Shape Retrieval.Journal of Visual Communication and Image Representation89 (2022), 103668

  77. [77]

    Wen Zhou, Jinyuan Jia, Wenying Jiang, and Chenxi Huang. 2020. Sketch augmentation-driven shape retrieval learning framework based on convolutional neural networks.IEEE transactions on visualization and computer graphics27, 8 (2020), 3558–3570

  78. [78]

    Cunjuan Zhu, Dongdong Cui, Qi Jia, Weimin Wang, Yu Liu, and Michael S Lew

  79. [79]

    In ICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

    Sketch-based 3d shape retrieval with multi-view fusion transformer. In ICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 3005–3009