Recognition: unknown
Text-to-CAD Retrieval: a Strong Baseline
Pith reviewed 2026-05-08 15:04 UTC · model grok-4.3
The pith
A multi-modal framework aligns text queries with CAD procedural sequences and point clouds to retrieve relevant models from large databases.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The central claim is that a unified framework learns multi-modal CAD embeddings from procedural sequences and geometric point clouds, with a novel feature decoder that reconstructs masked sequence features via cross-attention with text and point features to encourage implicit alignment, so that at inference time the concatenated sequence-point features support accurate text-based retrieval.
What carries the argument
The feature decoder that reconstructs masked sequence features via cross-attention with text and point features to produce aligned multi-modal embeddings.
If this is right
- The approach supplies a concrete benchmark and baseline for measuring progress on text-to-CAD retrieval.
- Efficient inference becomes possible by dropping the decoder and using only the concatenated sequence and point features.
- The framework can serve as a starting point for retrieval-augmented generation of new CAD models.
- Combining construction logic from sequences with explicit geometry from points improves matching over using either alone.
Where Pith is reading between the lines
- Design software could embed this retrieval step to suggest similar legacy models while a user is building a new part.
- The same alignment pattern might transfer to text-based search in other procedural modeling domains such as architecture or manufacturing.
- Scaling the encoders to larger CAD collections could test whether the alignment remains stable as database size grows.
Load-bearing premise
The paired text descriptions in the dataset accurately reflect the semantics of the CAD models so that the alignment produces embeddings that work on new queries outside the training set.
What would settle it
If the method's retrieval accuracy on the Text2CAD test split falls below that of a simple keyword-matching baseline on the same data, the multi-modal alignment would not deliver the claimed advantage.
read the original abstract
Text-based retrieval of Computer-Aided Design (CAD) models is a critical yet underexplored task for the reuse of legacy industrial designs. Existing CAD repositories are typically searched using filenames or directories, which limits the efficiency, scalability, and accuracy of design retrieval. In this paper, we formally introduce text-to-CAD retrieval as a new cross-modal retrieval task, aiming to retrieve semantically relevant CAD models from large-scale databases given natural language queries. Leveraging paired text-CAD annotations from the Text2CAD dataset, we establish a practical benchmark for this task. To achieve text-based retrieval, we propose a unified framework that learns multi-modal CAD embeddings from both procedural sequences and geometric point clouds. Specifically, a sequence encoder captures the construction logic of CAD models, while a point encoder extracts explicit geometric features. A text encoder is used to learn semantic representations of textual queries. During training, we introduce a novel feature decoder that reconstructs masked sequence features via cross-attention with text and point features, encouraging implicit multi-modal alignment. At inference time, we remove this auxiliary decoder to enable efficient retrieval using concatenated sequence-point features. Our framework serves as a strong baseline for text-to-CAD retrieval and lays the foundation for downstream CAD generation paradigms, such as retrieval-augmented generation. The source code will be released.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces text-to-CAD retrieval as a new cross-modal task for searching CAD models using natural language queries. It proposes a framework with a sequence encoder for procedural construction logic, a point encoder for geometric features, and a text encoder, trained via a feature decoder that reconstructs masked sequence features through cross-attention with text and point features to achieve implicit multi-modal alignment. At inference, the decoder is removed and concatenated sequence-point embeddings are used for retrieval. The work claims this serves as a strong baseline and foundation for downstream tasks like retrieval-augmented CAD generation, using the Text2CAD dataset.
Significance. If the empirical claims hold, the work would establish an initial benchmark and practical method for text-based retrieval in CAD repositories, addressing limitations of filename-based search in industrial design reuse. The dual use of procedural sequences and point clouds is a domain-appropriate choice for CAD representations that could support future retrieval-augmented generation pipelines.
major comments (3)
- [Abstract] Abstract: The central claim that the framework 'serves as a strong baseline' for text-to-CAD retrieval is unsupported because the manuscript provides no quantitative retrieval metrics (e.g., Recall@K, mAP), ablation studies, baseline comparisons, or error analysis on the Text2CAD dataset.
- [Training description] Training description (feature decoder): The masked sequence reconstruction objective via cross-attention with text and point features is an indirect alignment signal that does not explicitly enforce metric similarity between text embeddings and the final concatenated sequence-point vectors used at inference. This leaves open the possibility that reconstruction can be achieved without learning retrieval-competitive, semantically discriminative features.
- [Method overview] Method overview: No details are given on how the concatenated sequence-point features are normalized or compared to text embeddings during retrieval (e.g., cosine similarity, learned projection), nor on any contrastive or ranking loss that would directly optimize the inference-time metric.
minor comments (1)
- [Abstract] The abstract states that source code will be released but provides no link or repository information in the manuscript.
Simulated Author's Rebuttal
We thank the referee for their detailed review and valuable feedback on our manuscript. We address each of the major comments below and outline the revisions we plan to make.
read point-by-point responses
-
Referee: [Abstract] Abstract: The central claim that the framework 'serves as a strong baseline' for text-to-CAD retrieval is unsupported because the manuscript provides no quantitative retrieval metrics (e.g., Recall@K, mAP), ablation studies, baseline comparisons, or error analysis on the Text2CAD dataset.
Authors: We acknowledge that the claim of serving as a 'strong baseline' requires empirical validation through quantitative metrics. The current manuscript focuses on introducing the task and the framework, but to substantiate this claim, we will add comprehensive experiments including Recall@K and mAP scores, ablation studies on the components (sequence encoder, point encoder, decoder), comparisons to simple baselines such as text-only or point-only retrieval, and error analysis on the Text2CAD dataset in the revised version. revision: yes
-
Referee: [Training description] Training description (feature decoder): The masked sequence reconstruction objective via cross-attention with text and point features is an indirect alignment signal that does not explicitly enforce metric similarity between text embeddings and the final concatenated sequence-point vectors used at inference. This leaves open the possibility that reconstruction can be achieved without learning retrieval-competitive, semantically discriminative features.
Authors: The referee correctly points out that the alignment is implicit through the reconstruction task. While the cross-attention mechanism requires the text and point features to be informative for reconstructing the masked sequence features, thereby encouraging semantic alignment, we agree that this is indirect. To strengthen the approach, we will incorporate a direct contrastive loss between the text embeddings and the concatenated sequence-point embeddings during training to explicitly optimize for the retrieval metric. We will also provide analysis demonstrating that the reconstruction objective leads to discriminative features suitable for retrieval. revision: yes
-
Referee: [Method overview] Method overview: No details are given on how the concatenated sequence-point features are normalized or compared to text embeddings during retrieval (e.g., cosine similarity, learned projection), nor on any contrastive or ranking loss that would directly optimize the inference-time metric.
Authors: We apologize for the omission of these implementation details. In the revised manuscript, we will specify that the concatenated sequence-point features and text embeddings are L2-normalized and compared using cosine similarity for retrieval. Additionally, we will introduce a contrastive loss to directly optimize this similarity metric during training, as noted in our response to the training description comment. revision: yes
Circularity Check
No circularity detected: reconstruction objective independent of retrieval metric
full rationale
The paper defines a new cross-modal retrieval task and proposes an architecture with separate sequence/point/text encoders plus an auxiliary feature decoder trained on masked sequence reconstruction via cross-attention. This training signal is distinct from the inference-time retrieval procedure that simply concatenates sequence and point features for similarity search against text embeddings. No equations, fitted parameters, or self-citations are shown that would make the claimed retrieval performance equivalent to the inputs by construction. The derivation chain remains self-contained with independent training and evaluation components.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Unified conditional image genera- tion for visible-infrared person re-identification,
H. Pan, W. Pei, X. Li, and Z. He, “Unified conditional image genera- tion for visible-infrared person re-identification,”IEEE Transactions on Information Forensics and Security, vol. 19, pp. 9026–9038, 2024. AUTHORet al.: TITLE 9
2024
-
[2]
Towards unified bi- jective image-text generation for text-to-image person re-identification,
Q. Wang, X. Ma, X. Jiang, J. Ji, and H. Pan, “Towards unified bi- jective image-text generation for text-to-image person re-identification,” Knowledge-Based Systems, p. 114014, 2025
2025
-
[3]
View-based 3-d cad model retrieval with deep residual networks,
C. Zhang, G. Zhou, H. Yang, Z. Xiao, and X. Yang, “View-based 3-d cad model retrieval with deep residual networks,”IEEE Transactions on Industrial Informatics, vol. 16, no. 4, pp. 2335–2345, 2019
2019
-
[4]
Complexgen: Cad reconstruction by b-rep chain complex generation,
H. Guo, S. Liu, H. Pan, Y . Liu, X. Tong, and B. Guo, “Complexgen: Cad reconstruction by b-rep chain complex generation,”ACM Transactions on Graphics, vol. 41, no. 4, pp. 1–18, 2022
2022
-
[5]
Brepgen: A b-rep generative diffusion model with structured latent geometry,
X. Xu, J. Lambourne, P. Jayaraman, Z. Wang, K. Willis, and Y . Fu- rukawa, “Brepgen: A b-rep generative diffusion model with structured latent geometry,”ACM Transactions on Graphics, vol. 43, no. 4, pp. 1–14, 2024
2024
-
[6]
Cmt: A cascade mar with topology predictor for multimodal conditional cad generation,
J. Wu, Y . Wang, X. Yue, X. Ma, J. Guo, D. Zhou, W. Ouyang, and S. Tang, “Cmt: A cascade mar with topology predictor for multimodal conditional cad generation,” inIEEE International Conference on Com- puter Vision, 2025, pp. 7014–7024
2025
-
[7]
Deepcad: A deep generative network for computer-aided design models,
R. Wu, C. Xiao, and C. Zheng, “Deepcad: A deep generative network for computer-aided design models,” inIEEE International Conference on Computer Vision, 2021, pp. 6772–6782
2021
-
[8]
Skexgen: Autoregressive generation of cad construction sequences with disentangled codebooks,
X. Xu, K. D. Willis, J. G. Lambourne, C.-Y . Cheng, P. K. Jayaraman, and Y . Furukawa, “Skexgen: Autoregressive generation of cad construction sequences with disentangled codebooks,” inInternational Conference on Machine Learning, 2022, pp. 24 698–24 724
2022
-
[9]
Hierarchical neural coding for controllable cad model generation,
X. Xu, P. K. Jayaraman, J. G. Lambourne, K. D. Willis, and Y . Furukawa, “Hierarchical neural coding for controllable cad model generation,” in International Conference on Machine Learning. PMLR, 2023, pp. 38 443–38 461
2023
-
[10]
Diffusion- cad: Controllable diffusion model for generating computer-aided design models,
A. Zhang, W. Jia, Q. Zou, Y . Feng, X. Wei, and Y . Zhang, “Diffusion- cad: Controllable diffusion model for generating computer-aided design models,”IEEE Transactions on Visualization and Computer Graphics, 2025
2025
-
[11]
Img2cad: Conditioned 3-d cad model generation from single image with structured visual geometry,
T. Chen, C. Yu, Y . Hu, J. Li, T. Xu, R. Cao, L. Zhu, Y . Zang, Y . Zhang, Z. Liet al., “Img2cad: Conditioned 3-d cad model generation from single image with structured visual geometry,”IEEE Transactions on Industrial Informatics, 2025
2025
-
[12]
Parametric primitive analysis of cad sketches with vision transformer,
X. Wang, L. Wang, H. Wu, G. Xiao, and K. Xu, “Parametric primitive analysis of cad sketches with vision transformer,”IEEE Transactions on Industrial Informatics, vol. 20, no. 10, pp. 12 041–12 050, 2024
2024
-
[13]
Text2cad: Generating sequential cad designs from beginner-to- expert level text prompts,
M. S. Khan, S. Sinha, T. U. Sheikh, D. Stricker, S. A. Ali, and M. Z. Afzal, “Text2cad: Generating sequential cad designs from beginner-to- expert level text prompts,”Advances in Neural Information Processing Systems, vol. 37, pp. 7552–7579, 2024
2024
-
[14]
Cad translator: An effective drive for text to 3d parametric computer-aided design generative modeling,
X. Li, Y . Song, Y . Lou, and X. Zhou, “Cad translator: An effective drive for text to 3d parametric computer-aided design generative modeling,” in Proceedings of the 32nd ACM International Conference on Multimedia, 2024, pp. 8461–8470
2024
-
[15]
Learning transferable visual models from natural language supervision,
A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clarket al., “Learning transferable visual models from natural language supervision,” inInternational Conference on Machine Learning. PMLR, 2021, pp. 8748–8763
2021
-
[16]
Bert: Pre-training of deep bidirectional transformers for language understanding,
J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “Bert: Pre-training of deep bidirectional transformers for language understanding,” inPro- ceedings of the 2019 conference of the North American chapter of the association for computational linguistics: human language technologies, volume 1 (long and short papers), 2019, pp. 4171–4186
2019
-
[17]
Pointnet: Deep learning on point sets for 3d classification and segmentation,
C. R. Qi, H. Su, K. Mo, and L. J. Guibas, “Pointnet: Deep learning on point sets for 3d classification and segmentation,” inIEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 652–660
2017
-
[18]
Point transformer,
H. Zhao, L. Jiang, J. Jia, P. H. Torr, and V . Koltun, “Point transformer,” inProceedings of the IEEE/CVF international conference on computer vision, 2021, pp. 16 259–16 268
2021
-
[19]
Point transformer v2: Grouped vector attention and partition-based pooling,
X. Wu, Y . Lao, L. Jiang, X. Liu, and H. Zhao, “Point transformer v2: Grouped vector attention and partition-based pooling,”Advances in Neural Information Processing Systems, vol. 35, pp. 33 330–33 342, 2022
2022
-
[20]
Evaluating retrieval quality in retrieval- augmented generation,
A. Salemi and H. Zamani, “Evaluating retrieval quality in retrieval- augmented generation,” inProceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Re- trieval, 2024, pp. 2395–2400
2024
-
[21]
Retrieval- augmented generation for knowledge-intensive nlp tasks,
P. Lewis, E. Perez, A. Piktus, F. Petroni, V . Karpukhin, N. Goyal, H. K ¨uttler, M. Lewis, W.-t. Yih, T. Rockt ¨aschelet al., “Retrieval- augmented generation for knowledge-intensive nlp tasks,”Advances in neural information processing systems, vol. 33, pp. 9459–9474, 2020
2020
-
[22]
Retrieval-Augmented Generation for Large Language Models: A Survey
Y . Gao, Y . Xiong, X. Gao, K. Jia, J. Pan, Y . Bi, Y . Dai, J. Sun, H. Wang, and H. Wang, “Retrieval-augmented generation for large language models: A survey,”arXiv preprint arXiv:2312.10997, vol. 2, no. 1, 2023
work page internal anchor Pith review arXiv 2023
-
[23]
Revisiting cad model generation by learning raster sketch,
P. Li, W. Zhang, J. Guo, J. Chen, and D.-M. Yan, “Revisiting cad model generation by learning raster sketch,” inProceedings of the AAAI Conference on Artificial Intelligence, vol. 39, no. 5, 2025, pp. 4869– 4877
2025
-
[24]
Cad-llama: leveraging large language models for computer-aided design parametric 3d model generation,
J. Li, W. Ma, X. Li, Y . Lou, G. Zhou, and X. Zhou, “Cad-llama: leveraging large language models for computer-aided design parametric 3d model generation,” inIEEE Conference on Computer Vision and Pattern Recognition, 2025, pp. 18 563–18 573
2025
-
[25]
Flexcad: Unified and versatile controllable cad generation with fine-tuned large language models,
Z. Zhang, S. Sun, W. Wang, D. Cai, and J. Bian, “Flexcad: Unified and versatile controllable cad generation with fine-tuned large language models,” inInternational Conference on Learning Representations, 2025
2025
-
[26]
The llama 3 herd of models,
A. Dubey, A. Jauhri, A. Pandey, A. Kadian, A. Al-Dahle, A. Letman, A. Mathur, A. Schelten, A. Yang, A. Fanet al., “The llama 3 herd of models,”arXiv e-prints, pp. arXiv–2407, 2024
2024
-
[27]
J. Achiam, S. Adler, S. Agarwal, L. Ahmad, I. Akkaya, F. L. Aleman, D. Almeida, J. Altenschmidt, S. Altman, S. Anadkatet al., “Gpt-4 technical report,”arXiv preprint arXiv:2303.08774, 2023
work page internal anchor Pith review arXiv 2023
-
[28]
Neural discrete representa- tion learning,
A. Van Den Oord, O. Vinyalset al., “Neural discrete representa- tion learning,”Conference on Neural Information Processing Systems, vol. 30, 2017
2017
-
[29]
Coca: Contrastive captioners are image-text foundation models,
J. Yu, Z. Wang, V . Vasudevan, L. Yeung, M. Seyedhosseini, and Y . Wu, “Coca: Contrastive captioners are image-text foundation models,” Transactions on Machine Learning Research
-
[30]
Llavanext: Improved reasoning, ocr, and world knowledge,
H. Liu, C. Li, Y . Li, B. Li, Y . Zhang, S. Shen, and Y . J. Lee, “Llavanext: Improved reasoning, ocr, and world knowledge,” 2024
2024
-
[31]
A. Q. Jiang, A. Sablayrolles, A. Roux, A. Mensch, B. Savary, C. Bam- ford, D. S. Chaplot, D. d. l. Casas, E. B. Hanna, F. Bressandet al., “Mixtral of experts,”arXiv preprint arXiv:2401.04088, 2024
work page internal anchor Pith review arXiv 2024
-
[32]
Text2shape: Generating shapes from natural language by learning joint embeddings,
K. Chen, C. B. Choy, M. Savva, A. X. Chang, T. Funkhouser, and S. Savarese, “Text2shape: Generating shapes from natural language by learning joint embeddings,” inAsian Conference on Computer Vision. Springer, 2018, pp. 100–116
2018
-
[33]
Y2seq2seq: Cross-modal representation learning for 3d shape and text by joint recon- struction and prediction of view and word sequences,
Z. Han, M. Shang, X. Wang, Y .-S. Liu, and M. Zwicker, “Y2seq2seq: Cross-modal representation learning for 3d shape and text by joint recon- struction and prediction of view and word sequences,” inProceedings of the AAAI Conference on Artificial Intelligence, vol. 33, no. 01, 2019, pp. 126–133
2019
-
[34]
Parts2words: Learning joint embedding of point clouds and texts by bidirectional matching between parts and words,
C. Tang, X. Yang, B. Wu, Z. Han, and Y . Chang, “Parts2words: Learning joint embedding of point clouds and texts by bidirectional matching between parts and words,” inIEEE Conference on Computer Vision and Pattern Recognition, 2023, pp. 6884–6893
2023
-
[35]
Tricolo: Trimodal contrastive loss for text to shape retrieval,
Y . Ruan, H.-H. Lee, Y . Zhang, K. Zhang, and A. X. Chang, “Tricolo: Trimodal contrastive loss for text to shape retrieval,” inProceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 2024, pp. 5815–5825
2024
-
[36]
Sca3d: Enhancing cross-modal 3d retrieval via 3d shape and caption paired data augmentation,
J. Ren, H. Wu, H. Xiong, and H. Wang, “Sca3d: Enhancing cross-modal 3d retrieval via 3d shape and caption paired data augmentation,” in2025 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2025, pp. 9550–9557
2025
-
[37]
Pointcloud- text matching: Benchmark dataset and baseline,
Y . Feng, Y . Qin, D. Peng, H. Zhu, X. Peng, and P. Hu, “Pointcloud- text matching: Benchmark dataset and baseline,”IEEE Transactions on Multimedia, 2025
2025
-
[38]
Audio-enhanced text-to-video retrieval using text-conditioned feature alignment,
S. Ibrahimi, X. Sun, P. Wang, A. Garg, A. Sanan, and M. Omar, “Audio-enhanced text-to-video retrieval using text-conditioned feature alignment,” inIEEE International Conference on Computer Vision, 2023, pp. 12 020–12 030
2023
-
[39]
Eclipse: Efficient long- range video retrieval using sight and sound,
Y .-B. Lin, J. Lei, M. Bansal, and G. Bertasius, “Eclipse: Efficient long- range video retrieval using sight and sound,” inIn Proceedings of the European Conference on Computer Vision, 2022, pp. 413–430
2022
-
[40]
Tri-modal motion retrieval by learning a joint embedding space,
K. Yin, S. Zou, Y . Ge, and Z. Tian, “Tri-modal motion retrieval by learning a joint embedding space,” inIEEE Conference on Computer Vision and Pattern Recognition, 2024, pp. 1596–1605
2024
-
[41]
Enhanced cross-modal 3d retrieval via tri-modal reconstruction,
J. Ren and H. Wang, “Enhanced cross-modal 3d retrieval via tri-modal reconstruction,” in2025 IEEE International Conference on Multimedia and Expo (ICME). IEEE, 2025, pp. 1–6
2025
-
[42]
Attention is all you need,
A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,”Advances in neural information processing systems, vol. 30, 2017
2017
-
[43]
Representation Learning with Contrastive Predictive Coding
A. v. d. Oord, Y . Li, and O. Vinyals, “Representation learning with contrastive predictive coding,”arXiv preprint arXiv:1807.03748, 2018
work page internal anchor Pith review arXiv 2018
-
[44]
Adam: A Method for Stochastic Optimization
D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980, 2014
work page internal anchor Pith review arXiv 2014
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.