pith. sign in

arxiv: 2604.18019 · v1 · submitted 2026-04-20 · 💻 cs.CV

Multi-View Hierarchical Graph Neural Network for Sketch-Based 3D Shape Retrieval

Pith reviewed 2026-05-10 04:56 UTC · model grok-4.3

classification 💻 cs.CV
keywords sketch-based 3D retrievalmulti-view graphsgraph neural networkshierarchical coarseningzero-shot retrievalCLIP alignment3D shape matching
0
0 comments X

The pith

MV-HGNN captures geometric relationships across 3D views with hierarchical graph coarsening and CLIP alignment to improve sketch-based shape retrieval in both standard and zero-shot settings.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper aims to show that treating multi-view 3D features as graphs, rather than independent encodings, produces stronger representations for matching hand-drawn sketches to 3D models. It builds view-level graphs to model local geometric dependencies and global cross-view relations, then applies a view selector for progressive coarsening that enlarges receptive fields while discarding redundant information. Projecting both sketch and shape features into CLIP text embedding space further allows the same architecture to operate without category-specific overfitting. A reader would care because prior aggregation methods lose inter-view structure and fail when sketches depict objects outside the training set. If the claim holds, retrieval systems could handle richer 3D detail and generalize to novel categories using one model.

Core claim

The Multi-View Hierarchical Graph Neural Network constructs a view-level graph and applies local graph convolution plus global attention to capture adjacent geometric dependencies and cross-view message passing; a view selector then performs hierarchical graph coarsening to produce progressively larger receptive fields and more discriminative multi-level 3D representations; finally, both sketch and 3D features are projected into a shared CLIP semantic space using text embeddings as prototypes, enabling a two-stage training strategy for category-level retrieval and a one-stage strategy for zero-shot retrieval under the same architecture, which the experiments show outperforms prior methods on

What carries the argument

View-level graph processed by local graph convolution combined with global attention, followed by hierarchical coarsening via a view selector and projection into CLIP semantic space.

If this is right

  • Improved retrieval accuracy for sketches against 3D shapes when categories are known in advance.
  • Effective retrieval when sketches belong to object categories absent from training data.
  • Hierarchical representations that preserve both fine local details and broader structural context from multiple views.
  • Reduced impact of uninformative or redundant viewpoints on the final 3D descriptor.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same view-graph plus selector pattern could be tested on other multi-observation 3D tasks such as video-based object classification.
  • CLIP alignment opens the possibility of mixing sketch queries with text descriptions in a single retrieval index.
  • The learned view-selection weights may indicate which camera angles are most informative for sketch-to-shape matching, offering a diagnostic for viewpoint importance.

Load-bearing premise

That modeling view relationships through graph convolution and attention, selecting important views hierarchically, and aligning to CLIP prototypes will produce 3D features discriminative enough to beat simpler aggregation methods without overfitting to seen categories.

What would settle it

Direct head-to-head retrieval metrics on the two public benchmarks showing that MV-HGNN does not exceed the accuracy of prior multi-view aggregation methods under either the category-level or zero-shot protocol.

Figures

Figures reproduced from arXiv: 2604.18019 by Chengfeng Xie, Hang Cheng, Long Zeng, Mingyu Fan, Muyan He, Xi Cheng.

Figure 1
Figure 1. Figure 1: Existing method and our method. modality followed by the sketch modality. For zero-shot SBSR tasks, we adopt a joint training strategy for both modalities. This strat￾egy enables the transition from designing for individual tasks to a more generalized retrieval framework managing various specialized scenarios. The main contributions of this paper are as follows: • We propose a novel Multi-View Hierarchical… view at source ↗
Figure 2
Figure 2. Figure 2: Overview of the proposed MV-HGNN framework. Left: Cross-modal alignment for 3D shape retrieval based on sketches. [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Illustration of the training strategies under [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Examples of view selection results on 3D shapes [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: The frequency of intra-class and inter-class dis [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: t-SNE Visualization of test samples on the [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Retrieval examples for sketch-based 3D shape re [PITH_FULL_IMAGE:figures/full_fig_p009_7.png] view at source ↗
read the original abstract

Sketch-based 3D shape retrieval (SBSR) aims to retrieve 3D shapes that are consistent with the category of the input hand-drawn sketch. The core challenge of this task lies in two aspects: existing methods typically employ simplified aggregation strategies for independently encoded 3D multi-view features, which ignore the geometric relationships between views and multi-level details, resulting in weak 3D representation. Simultaneously, traditional SBSR methods are constrained by visible category limitations, leading to poor performance in zero-shot scenarios. To address these challenges, we propose Multi-View Hierarchical Graph Neural Network (MV-HGNN), a novel framework for SBSR. Specifically, we construct a view-level graph and capture adjacent geometric dependencies and cross-view message passing via local graph convolution and global attention. A view selector is further introduced to perform hierarchical graph coarsening, enabling a progressively larger receptive field for graph convolution and mitigating the interference of redundant views, which leads to more discriminate discriminative hierarchical 3D representation. To enable category agnostic alignment and mitigate overfitting to seen classes, we leverage CLIP text embeddings as semantic prototypes and project both sketch and 3D features into a shared semantic space. We use a two-stage training strategy for category-level retrieval and a one-stage strategy for zero-shot retrieval under the same model architecture. Under both category-level and zero-shot settings, extensive experiments on two public benchmarks demonstrate that MV-HGNN outperforms state-of-the-art methods.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 3 minor

Summary. The paper proposes MV-HGNN, a multi-view hierarchical graph neural network for sketch-based 3D shape retrieval. It builds a view-level graph to model geometric relationships among multi-view 3D features via local graph convolution and global attention, applies a view selector for hierarchical coarsening to enlarge receptive fields while reducing redundant views, and projects both sketch and 3D features into a shared CLIP semantic space for category-agnostic alignment. Separate training protocols (two-stage for category-level, one-stage for zero-shot) are used under the same architecture, with claims of outperformance over state-of-the-art methods on two public benchmarks under both settings.

Significance. If the reported gains hold under rigorous evaluation, the work would advance SBSR by replacing simplified multi-view aggregation with explicit graph-based modeling of view dependencies and hierarchical coarsening, while the CLIP-based projection offers a practical route to zero-shot generalization. The architecture's combination of local/global message passing and progressive coarsening provides a concrete, reproducible template for learning multi-level 3D representations from sketches that could transfer to related cross-modal retrieval tasks.

major comments (2)
  1. [§4] §4 (Experiments): The central claim of consistent outperformance in both category-level and zero-shot regimes rests on benchmark results, yet the manuscript supplies no statistical significance tests (e.g., paired t-tests or Wilcoxon ranks) across multiple runs, nor does it report variance or confidence intervals for the reported metrics; without these, it is impossible to determine whether the observed margins over baselines are reliable or could arise from random seed variation.
  2. [§3.3] §3.3 (View Selector): The hierarchical coarsening step is load-bearing for the claim of 'more discriminative hierarchical 3D representation,' but the description does not specify whether the view selector is trained end-to-end with a dedicated loss or via a separate pre-training stage; if the selector parameters are not jointly optimized with the downstream retrieval objective, the progressive receptive-field benefit may not materialize and the architecture reduces to standard multi-view pooling.
minor comments (3)
  1. [Abstract] Abstract: The phrase 'more discriminate discriminative hierarchical 3D representation' contains a repeated word; replace with 'more discriminative hierarchical 3D representation'.
  2. [§2] §2 (Related Work): The discussion of prior multi-view aggregation methods would benefit from an explicit comparison table listing the aggregation strategy (mean/max/attention) and whether geometric view relationships are modeled; this would clarify the precise novelty of the local-convolution-plus-global-attention design.
  3. [Figure 2] Figure 2: The diagram of the view-level graph construction and coarsening stages is difficult to follow because edge weights and the coarsening ratio are not annotated on the figure; adding these labels would improve readability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the positive assessment and constructive comments. We address each major comment below and will revise the manuscript to incorporate the suggested clarifications and additions.

read point-by-point responses
  1. Referee: [§4] §4 (Experiments): The central claim of consistent outperformance in both category-level and zero-shot regimes rests on benchmark results, yet the manuscript supplies no statistical significance tests (e.g., paired t-tests or Wilcoxon ranks) across multiple runs, nor does it report variance or confidence intervals for the reported metrics; without these, it is impossible to determine whether the observed margins over baselines are reliable or could arise from random seed variation.

    Authors: We agree that statistical analysis would further strengthen the claims. The original submission reported single-run results, but the performance margins were observed to be stable across preliminary checks with different seeds. In the revised manuscript we will rerun all experiments with at least five random seeds, report mean and standard deviation for every metric, and include paired t-tests (or Wilcoxon signed-rank tests where appropriate) against the strongest baselines to confirm statistical significance of the improvements. revision: yes

  2. Referee: [§3.3] §3.3 (View Selector): The hierarchical coarsening step is load-bearing for the claim of 'more discriminative hierarchical 3D representation,' but the description does not specify whether the view selector is trained end-to-end with a dedicated loss or via a separate pre-training stage; if the selector parameters are not jointly optimized with the downstream retrieval objective, the progressive receptive-field benefit may not materialize and the architecture reduces to standard multi-view pooling.

    Authors: We apologize for the ambiguity in the original description. The view selector is trained end-to-end jointly with the rest of the MV-HGNN (graph convolutions, attention, and projection layers) using the same retrieval objective; no separate pre-training stage or dedicated loss is employed. This joint optimization ensures the coarsening decisions directly improve the final cross-modal alignment. We will insert an explicit statement in §3.3 of the revised manuscript clarifying the end-to-end training procedure. revision: yes

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper proposes an empirical neural architecture (view-level graph with local convolution plus global attention, hierarchical coarsening via view selector, and CLIP semantic projection) and reports its performance on external benchmarks under category-level and zero-shot protocols. No mathematical derivation, first-principles prediction, or parameter-fitting step is described that reduces to its own inputs by construction. Claims rest on experimental evaluation rather than self-definitional equations, fitted-input predictions, or load-bearing self-citations. The architecture is presented as a design choice evaluated externally, with no internal reduction of results to inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No explicit free parameters, mathematical axioms, or postulated physical entities appear in the abstract; the framework relies on standard graph convolution operations and pre-trained CLIP embeddings.

pith-pipeline@v0.9.0 · 5571 in / 1132 out tokens · 41618 ms · 2026-05-10T04:56:15.745903+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

68 extracted references · 68 canonical work pages · 1 internal anchor

  1. [1]

    Shaojin Bai and Jing Bai. 2023. HDA2L: Hierarchical domain-augmented adaptive learning for sketch-based 3D shape retrieval.Knowledge-Based Systems264 (2023), 110302

  2. [2]

    Shaojin Bai, Jing Bai, Hao Xu, Jiwen Tuo, and Min Liu. 2023. PAGML: Precise alignment guided metric learning for sketch-based 3D shape retrieval.Image and Vision Computing136 (2023), 104756

  3. [3]

    Shaojin Bai, Yalu Li, Rihao Chang, Qi Liang, and Weizhi Nie. 2025. SCDL: Sketch Causal Disentangled Learning for Sketch-based 3D Shape Retrieval.IEEE Trans- actions on Circuits and Systems for Video Technology(2025)

  4. [4]

    Hospedales, and Yi-Zhe Song

    Hmrishav Bandyopadhyay, Pinaki Nath Chowdhury, Ayan Kumar Bhunia, Anee- shan Sain, Tao Xiang, Timothy M. Hospedales, and Yi-Zhe Song. 2024. SketchINR: A first look into sketches as implicit neural representations. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 12565–12574

  5. [5]

    Hmrishav Bandyopadhyay, Pinaki Nath Chowdhury, Ayan Kumar Bhunia, Anee- shan Sain, Tao Xiang, and Yi-Zhe Song. 2024. What Sketch Explainability Really Means for Downstream Tasks?. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 10997–11008

  6. [6]

    Yiyang Cai, Jiaming Lu, Jiewen Wang, and Shuang Liang. 2023. Uncertainty- aware cross-modal transfer network for sketch-based 3D shape retrieval. InIEEE International Conference on Multimedia and Expo. 132–137

  7. [7]

    Bo Chen, Alvaro Parra, Jiewei Cao, Nan Li, and Tat-Jun Chin. 2020. End-to-end learnable geometric vision by backpropagating pnp optimization. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 8100– 8109

  8. [8]

    Jie Chen and Yi Fang. 2018. Deep Cross-Modality Adaptation via Semantics Pre- serving Adversarial Learning for Sketch-Based 3D Shape Retrieval. InProceedings of the European Conference on Computer Vision. 605–620

  9. [9]

    Yan Chen, Di Huang, Zhichao Liao, Xi Cheng, Xinghui Li, and Long Zeng. 2025. Training-free point cloud recognition based on geometric and semantic infor- mation fusion. InICASSP 2025-2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 1–5

  10. [10]

    Pinaki Nath Chowdhury et al. 2023. Democratising 2D Sketch to 3D Shape Re- trieval through Pivoting. InProceedings of the IEEE/CVF International Conference on Computer Vision

  11. [11]

    Guoxian Dai, Jin Xie, and Yi Fang. 2018. Deep correlated holistic metric learning for sketch-based 3D shape retrieval.IEEE Transactions on Image Processing27, 7 (2018), 3374–3386

  12. [12]

    Guoxian Dai, Jin Xie, Fan Zhu, and Yi Fang. 2017. Deep correlated metric learning for sketch-based 3D shape retrieval. InProceedings of the AAAI Conference on Artificial Intelligence, Vol. 31

  13. [13]

    Weidong Dai and Shuang Liang. 2020. Cross-modal guidance network for sketch- based 3D shape retrieval. InIEEE International Conference on Multimedia and Expo. 1–6

  14. [14]

    Tal Darom and Yosi Keller. 2012. Scale-invariant features for 3-D mesh models. IEEE Transactions on Image Processing21, 5 (2012), 2758–2769

  15. [15]

    Cheng Deng, Xinxun Xu, Hao Wang, Muli Yang, and Dacheng Tao. 2020. Progres- sive cross-modal semantic network for zero-shot sketch-based image retrieval. IEEE Transactions on Image Processing29 (2020), 8892–8902

  16. [16]

    Alexey Dosovitskiy. 2020. An image is worth 16x16 words: Transformers for image recognition at scale.arXiv preprint arXiv:2010.11929(2020)

  17. [17]

    Carlos Esteves, Christine Allen-Blanchette, Ameesh Makadia, and Kostas Dani- ilidis. 2018. Learning so (3) equivariant representations with spherical cnns. In Proceedings of the european conference on computer vision (ECCV). 52–68

  18. [18]

    Xinwei He, Yang Zhou, Zhichao Zhou, Song Bai, and Xiang Bai. 2018. Triplet- center loss for multi-view 3d object retrieval. InProceedings of the IEEE conference on computer vision and pattern recognition. 1945–1954

  19. [19]

    Menglin Jia, Luming Tang, Bor-Chun Chen, Claire Cardie, Serge Belongie, Bharath Hariharan, and Ser-Nam Lim. 2022. Visual prompt tuning. InEuro- pean conference on computer vision. Springer, 709–727

  20. [20]

    Roman Klokov and Victor Lempitsky. 2017. Escape from cells: Deep kd-networks for the recognition of 3d point cloud models. InProceedings of the IEEE interna- tional conference on computer vision. 863–872

  21. [21]

    Y. Lei, Z. Zhou, P. Zhang, P. Guo, Z. Ma, and L. Liu. 2019. Deep Point-to-Subspace Metric Learning for Sketch-Based 3D Shape Retrieval.Pattern Recognition96 (2019), 106–116

  22. [22]

    Bo Li, Yijuan Lu, Afzal Godil, Thomas Schreck, et al . 2014. A Comparison of Methods for Sketch-Based 3D Shape Retrieval.Computer Vision and Image Understanding119, 6 (2014), 57–80

  23. [23]

    Johan, J

    Bo Li, Yijuan Lu, Afzal Godil, Thomas Schreck, Makoto Aono, H. Johan, J. Saave- dra, and S. Tashiro. 2013. SHREC’13 Track: Large Scale Sketch-Based 3D Shape Retrieval. InEurographics Workshop on 3D Object Retrieval. 89–96

  24. [24]

    Bo Li, Yijuan Lu, Chen Li, Afzal Godil, et al. 2014. SHREC’14 Track: Extended Large Scale Sketch-Based 3D Shape Retrieval. InEurographics Workshop on 3D Object Retrieval. 121–130

  25. [25]

    Chang-Xing Li, Donglin Zhang, Zhikai Hu, and Xiao-Jun Wu. 2025. Modality Fused Class-Proxy with Knowledge Distillation for Zero-Shot Sketch-based Image Retrieval.IEEE Transactions on Circuits and Systems for Video Technology(2025)

  26. [26]

    Xue Li, Jiong Yu, Ziyang Li, Hongchun Lu, and Ruifeng Yuan. 2024. Dr. clip: Clip- driven universal framework for zero-shot sketch image retrieval. InProceedings of the 32nd ACM international conference on multimedia. 9554–9562

  27. [27]

    Shuang Liang, Weidong Dai, Yiyang Cai, and Chi Xie. 2024. Sketch-based 3D shape retrieval via teacher–student learning.Computer Vision and Image Under- standing239 (2024), 103903

  28. [28]

    Shuang Liang, Weidong Dai, and Yichen Wei. 2021. Uncertainty learning for noise resistant sketch-based 3D shape retrieval.IEEE Transactions on Image Processing30 (2021), 8632–8643

  29. [29]

    Zhichao Liao, Fengyuan Piao, Di Huang, Xinghui Li, Yue Ma, Pingfa Feng, Heming Fang, and Long Zeng. 2024. Freehand sketch generation from mechanical com- ponents. InProceedings of the 32nd ACM international conference on multimedia. 6755–6764

  30. [30]

    Fengyin Lin, Mingkang Li, Da Li, Timothy Hospedales, Yi-Zhe Song, and Yong- gang Qi. 2023. Zero-shot everything sketch-based image retrieval, and in ex- plainable style. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 23349–23358

  31. [31]

    Daniel Maturana and Sebastian Scherer. 2015. Voxnet: A 3d convolutional neural network for real-time object recognition. In2015 IEEE/RSJ international conference on intelligent robots and systems (IROS). Ieee, 922–928

  32. [32]

    Min Meng, Wenhang Chen, Jigang Liu, Jun Yu, and Jigang Wu. 2024. CoDi: Contrastive Disentanglement Generative Adversarial Networks for Zero-Shot Sketch-Based 3D Shape Retrieval.IEEE Transactions on Circuits and Systems for Video Technology(2024)

  33. [33]

    Jeong Joon Park, Peter Florence, Julian Straub, Richard Newcombe, and Steven Lovegrove. 2019. Deepsdf: Learning continuous signed distance functions for shape representation. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition. 165–174

  34. [34]

    Hospedales, Tao Xiang, and Yi-Zhe Song

    Anran Qi, Yulia Gryaditskaya, Jeifei Song, Yongxin Yang, Yonggang Qi, Timo- thy M. Hospedales, Tao Xiang, and Yi-Zhe Song. 2021. Toward Fine-Grained Sketch-Based 3D Shape Retrieval.IEEE Transactions on Image Processing(2021)

  35. [35]

    Anran Qi, Yi-Zhe Song, and Tao Xiang. 2018. Semantic Embedding for Sketch- Based 3D Shape Retrieval. InBritish Machine Vision Conference, Vol. 3. 11–12

  36. [36]

    Charles R Qi, Hao Su, Kaichun Mo, and Leonidas J Guibas. 2017. Pointnet: Deep learning on point sets for 3d classification and segmentation. InProceedings of the IEEE conference on computer vision and pattern recognition. 652–660

  37. [37]

    Charles Ruizhongtai Qi, Li Yi, Hao Su, and Leonidas J Guibas. 2017. Pointnet++: Deep hierarchical feature learning on point sets in a metric space.Advances in neural information processing systems30 (2017)

  38. [38]

    Jie Qin, Shuaihang Yuan, Jiaxin Chen, Boulbaba Ben Amor, Yi Fang, Nhat Hoang- Xuan, Chi-Bien Chu, Khoi-Nguyen Nguyen-Ngoc, Thien-Tri Cao, Nhat-Khang Ngo, et al. 2022. SHREC’22 track: Sketch-based 3D shape retrieval in the wild. Computers & Graphics107 (2022), 104–115

  39. [39]

    Alec Radford, Jong Wook Kim, Chris Hallacy, et al. 2021. Learning Transferable Visual Models from Natural Language Supervision. InInternational Conference on Machine Learning. 8748–8763

  40. [40]

    J. M. Saavedra, B. Bustos, T. Schreck, S. M. Yoon, and M. Scherer. 2012. Sketch- Based 3D Model Retrieval Using Keyshapes for Global and Local Representation. In3D Object Retrieval Workshop at Eurographics. 47–50. Multi-View Hierarchical Graph Neural Network for Sketch-Based 3D Shape Retrieval SIGIR ’26, June 03–05, 2026, Melbourne, Australia

  41. [41]

    Aneeshan Sain, Ayan Kumar Bhunia, Pinaki Nath Chowdhury, Subhadeep Koley, Tao Xiang, and Yi-Zhe Song. 2023. Clip for all things zero-shot sketch-based image retrieval, fine-grained or not. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2765–2775

  42. [42]

    Aneeshan Sain, Ayan Kumar Bhunia, Vaishnav Potlapalli, Pinaki Nath Chowd- hury, Tao Xiang, and Yi-Zhe Song. 2022. Sketch3t: Test-time training for zero-shot sbir. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 7462–7471

  43. [43]

    Mainak Singha, Ankit Jha, Divyam Gupta, Pranav Singla, and Biplab Banerjee

  44. [44]

    InEuropean Conference on Computer Vision

    Elevating all zero-shot sketch-based image retrieval through multimodal prompt learning. InEuropean Conference on Computer Vision. Springer, 1–19

  45. [45]

    Hang Su, Subhransu Maji, Evangelos Kalogerakis, and Erik Learned-Miller. 2015. Multi-view convolutional neural networks for 3d shape recognition. InProceed- ings of the IEEE international conference on computer vision. 945–953

  46. [46]

    Yawen Su, Jing Bai, and Gan Lin. 2025. DKD 2 L: Dual Knowledge Distillation Dynamic Learning for sketch-based 3D shape retrieval. InICASSP 2025-2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 1–5

  47. [47]

    Yawen Su, Wenjing Li, Jing Bai, and Gan Lin. 2025. SKD-SBSR: Structural Knowl- edge Distillation for Sketch-Based 3D Shape Retrieval.Knowledge-Based Systems 310 (2025), 112891

  48. [48]

    Jialin Tian, Xing Xu, Zheng Wang, Fumin Shen, and Xin Liu. 2021. Relationship- preserving knowledge distillation for zero-shot sketch based image retrieval. In Proceedings of the 29th ACM international conference on multimedia. 5473–5481

  49. [49]

    Bingrui Wang and Yuan Zhou. 2023. Doodle to Object: Practical Zero-Shot Sketch- Based 3D Shape Retrieval. InProceedings of the AAAI Conference on Artificial Intelligence, Vol. 37. 2474–2482

  50. [50]

    Fang Wang, Le Kang, and Yi Li. 2015. Sketch-based 3D shape retrieval using convolutional neural networks. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1875–1883

  51. [51]

    Peng-Shuai Wang, Yang Liu, Yu-Xiao Guo, Chun-Yu Sun, and Xin Tong. 2017. O-cnn: Octree-based convolutional neural networks for 3d shape analysis.ACM Transactions On Graphics (TOG)36, 4 (2017), 1–11

  52. [52]

    Sarma, Michael M

    Yue Wang, Yongbin Sun, Ziwei Liu, Sanjay E. Sarma, Michael M. Bronstein, and Justin M. Solomon. 2019. Dynamic graph CNN for learning on point clouds.ACM Transactions on Graphics38, 5 (2019), 1–12

  53. [53]

    Sanghyun Woo, Jongchan Park, Joon-Young Lee, and In So Kweon. 2018. Pro- ceedings of the European conference on computer vision (ECCV). InProceedings of the European conference on computer vision (ECCV), Vol. 3. 8

  54. [54]

    Wong, and Yi Fang

    Jin Xie, Guoxian Dai, Fan Zhu, Edward K. Wong, and Yi Fang. 2016. Deepshape: Deep-learned shape descriptor for 3D shape retrieval.IEEE Transactions on Pattern Analysis and Machine Intelligence39, 7 (2016)

  55. [55]

    R. Xu, Z. Han, L. Hui, J. Qian, and J. Xie. 2022. Domain Disentangled Gener- ative Adversarial Network for Zero-Shot Sketch-Based 3D Shape Retrieval. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 36. 2902–2910

  56. [56]

    Yifan Xu, Tianqi Fan, Mingye Xu, Long Zeng, and Yu Qiao. 2018. Spidercnn: Deep learning on point sets with parameterized convolutional filters. InProceedings of the European conference on computer vision (ECCV). 87–102

  57. [57]

    Yongzhe Xu, Jiangchuan Hu, Kanoksak Wattanachote, Kun Zeng, and YongYi Gong. 2020. Sketch-based shape retrieval via best view selection and a cross- domain similarity measure.IEEE Transactions on Multimedia22, 11 (2020), 2950– 2962

  58. [58]

    Hairui Yang, Yu Tian, Caifei Yang, Zhihui Wang, Lei Wang, and Haojie Li. 2022. Sequential learning for sketch-based 3D model retrieval.Multimedia Systems (2022), 1–18

  59. [59]

    Sang Min Yoon, Maximilian Scherer, Tobias Schreck, and Arjan Kuijper. 2010. Sketch-based 3D model retrieval using diffusion tensor fields of suggestive con- tours. InProceedings of the 18th ACM international conference on Multimedia. 193–200

  60. [60]

    Qian Yu, Feng Liu, Yi-Zhe Song, Tao Xiang, Timothy M Hospedales, and Chen- Change Loy. 2016. Sketch me that shoe. InProceedings of the IEEE conference on computer vision and pattern recognition. 799–807

  61. [61]

    Shuaihang Yuan, Congcong Wen, Yu-Shen Liu, and Yi Fang. 2023. Retrieval- specific view learning for sketch-to-shape retrieval.IEEE Transactions on Multi- media27 (2023), 768–779

  62. [62]

    Long Zeng, Zhi-kai Dong, Jia-yi Yu, Jun Hong, and Hong-yu Wang. 2019. Sketch- based retrieval and instantiation of parametric parts.Computer-Aided Design113 (2019), 82–95

  63. [63]

    Long Zeng, Yong-jin Liu, Jin Wang, Dong-liang Zhang, and Matthew Ming-Fai Yuen. 2014. Sketch2Jewelry: Semantic feature modeling for sketch-based jewelry design.Computers & graphics38 (2014), 69–77

  64. [64]

    Donglin Zhang, Changxing Li, and Xiao-Jun Wu. 2025. Multi-level Encoding with Hierarchical Alignment for Sketch-Based 3D Shape Retrieval. InProceedings of the 48th International ACM SIGIR Conference on Research and Development in Information Retrieval. 1033–1043

  65. [65]

    Y. Zhao, Q. Liang, R. Ma, W. Nie, and Y. Su. 2022. JFLN: Joint Feature Learning Net- work for 2D Sketch Based 3D Shape Retrieval.Journal of Visual Communication and Image Representation89 (2022), 103668

  66. [66]

    Wen Zhou, Jinyuan Jia, Wenying Jiang, and Chenxi Huang. 2020. Sketch augmentation-driven shape retrieval learning framework based on convolutional neural networks.IEEE transactions on visualization and computer graphics27, 8 (2020), 3558–3570

  67. [67]

    Cunjuan Zhu, Dongdong Cui, Qi Jia, Weimin Wang, Yu Liu, and Michael S Lew

  68. [68]

    In ICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

    Sketch-based 3d shape retrieval with multi-view fusion transformer. In ICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 3005–3009