Text2CAD-Bench: A Benchmark for LLM-based Text-to-Parametric CAD Generation
Pith reviewed 2026-05-20 12:44 UTC · model grok-4.3
The pith
Text2CAD-Bench shows current LLMs handle basic CAD geometry but degrade sharply on complex topology and advanced features.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Text2CAD-Bench is the first benchmark that systematically tests text-to-parametric CAD generation across geometric complexity and application domains. It supplies 600 examples divided into L1-L2 for standard features, L3 for complex topology and freeform surfaces, and L4 for real-world uses outside mechanical parts, each paired with dual-style prompts. Evaluation of general and domain-specific LLMs finds solid performance on basic geometry that falls substantially when models must manage advanced topology or non-standard domains.
What carries the argument
The four-level hierarchy of 600 examples, each carrying both geometric and procedural prompts, that measures how model accuracy changes with rising topological and domain complexity.
If this is right
- Targeted model improvements are required for complex topology and freeform surface handling before text-to-CAD can support realistic design tasks.
- Expansion into L4-style non-mechanical domains becomes feasible only after the observed performance gaps close.
- Dual prompt styles enable separate measurement of how well models interpret casual user language versus precise procedural instructions.
- Public release of the benchmark supplies a shared testbed that can accelerate comparison and progress across text-to-CAD methods.
Where Pith is reading between the lines
- If models close the gap on L3 and L4, text-driven CAD could shorten iteration loops in product development by letting engineers describe changes in words rather than redraw sketches.
- The benchmark structure could be reused to create parallel tests for text-to-3D or text-to-manufacturing pipelines that face similar topology challenges.
- Fine-tuning on procedural prompt sequences might produce measurable gains on freeform cases, offering a concrete next experiment.
Load-bearing premise
The 600 human-curated examples and their four-level division accurately represent the distribution of challenges encountered in practical text-to-parametric CAD workflows.
What would settle it
A new collection of 200 industry-sourced CAD files, drawn independently of the original curation process, on which the same models show no performance drop when complexity increases would falsify the claim that the benchmark captures representative practical difficulty.
Figures
read the original abstract
Text-to-CAD generation aims to create parametric CAD models from natural language, enabling rapid prototyping and intuitive design workflows. However, existing benchmarks focus on basic primitives and simple sketch-extrude sequences, lacking advanced features essential for real-world applications and covering only traditional mechanical parts. We introduce Text2CAD-Bench, the first benchmark systematically evaluating text-to-CAD across geometric complexity and application diversity. Our benchmark comprises 600 human-curated examples spanning four levels: L1-L2 cover fundamental geometry with standard features, L3 introduces complex topology and freeform surfaces, and L4 extends to real-world domains beyond mechanical parts. Each example pairs dual-style prompts -- geometric descriptions mimicking non-expert users, and procedural sequences aligned with expert-level conventions. Evaluating mainstream general LLMs and domain-specific models, we find that current models perform reasonably on basic geometry but degrade substantially on complex topology and advanced features. We release our benchmark to drive progress in text-to-CAD research.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces Text2CAD-Bench, the first benchmark for evaluating LLM-based text-to-parametric CAD generation. It consists of 600 human-curated examples spanning four complexity levels (L1-L2: basic geometry and standard features; L3: complex topology and freeform surfaces; L4: real-world domains beyond mechanical parts), each with dual-style prompts (geometric descriptions for non-experts and procedural sequences for experts). Evaluations of general and domain-specific LLMs show reasonable performance on basic geometry but substantial degradation on complex topology and advanced features.
Significance. If the curation and evaluation methodology prove robust, the benchmark would fill a clear gap in existing CAD evaluation resources by incorporating advanced features and application diversity. The open release of the dataset would support reproducible progress in text-to-CAD research and help identify specific model limitations for complex parametric sequences.
major comments (3)
- [Benchmark Construction] Benchmark Construction section: the curation process for the 600 examples provides no details on selection criteria, inter-rater reliability among human curators, or validation steps, which directly affects the validity of the four-level division and the claim that L3-L4 accurately capture practical challenges.
- [Evaluation] Evaluation section: no statistical significance tests, confidence intervals, or variance measures are reported for the performance differences across levels, leaving the central claim of 'substantial degradation' on complex topology difficult to assess quantitatively.
- [Prompt Design] Prompt Design and L3-L4 examples: geometric natural-language prompts for freeform surfaces and non-manifold topology frequently under-constrain the target parametric sequence (multiple extrude orders or surface parameterizations can satisfy the same description), which risks confounding prompt ambiguity with intrinsic model limitations in the reported degradation.
minor comments (1)
- [Abstract] The abstract would benefit from including at least one concrete metric (e.g., success rate or average edit distance) to quantify the 'reasonable' vs. 'substantial degradation' findings.
Simulated Author's Rebuttal
We thank the referee for their thoughtful and constructive comments. We address each major comment point by point below, indicating where revisions have been made to the manuscript.
read point-by-point responses
-
Referee: [Benchmark Construction] Benchmark Construction section: the curation process for the 600 examples provides no details on selection criteria, inter-rater reliability among human curators, or validation steps, which directly affects the validity of the four-level division and the claim that L3-L4 accurately capture practical challenges.
Authors: We agree that additional details on the curation process would improve transparency and strengthen claims about the benchmark's validity. In the revised manuscript we have expanded the Benchmark Construction section with explicit selection criteria (coverage of geometric primitives, topological complexity metrics, and domain diversity), a description of the multi-expert curation workflow, and validation steps including expert cross-review for level assignment. Formal inter-rater reliability metrics were not computed during the original curation; we therefore describe the consensus process used instead of reporting statistics that were not collected. revision: partial
-
Referee: [Evaluation] Evaluation section: no statistical significance tests, confidence intervals, or variance measures are reported for the performance differences across levels, leaving the central claim of 'substantial degradation' on complex topology difficult to assess quantitatively.
Authors: We concur that quantitative statistical support would make the degradation claim more robust. The revised manuscript now includes bootstrap-derived 95% confidence intervals on success rates per level and paired statistical tests (McNemar’s test) comparing performance across complexity levels. These results are reported in the Evaluation section and supporting figures. revision: yes
-
Referee: [Prompt Design] Prompt Design and L3-L4 examples: geometric natural-language prompts for freeform surfaces and non-manifold topology frequently under-constrain the target parametric sequence (multiple extrude orders or surface parameterizations can satisfy the same description), which risks confounding prompt ambiguity with intrinsic model limitations in the reported degradation.
Authors: We recognize that natural-language prompts for complex topology can admit multiple valid parametric realizations. The dual-prompt design (geometric description paired with procedural sequence) was intended to reduce this ambiguity, and the observed performance drop is consistent across both prompt styles. In the revision we have added a dedicated discussion of prompt ambiguity, its potential confounding role, and the steps taken during curation to constrain prompts. We also include additional constrained prompt examples in the appendix. revision: partial
Circularity Check
No circularity: benchmark introduces independent evaluation data
full rationale
The paper introduces Text2CAD-Bench as a new human-curated dataset of 600 examples across four complexity levels and evaluates LLMs directly on it. No equations, fitted parameters, or derivations are present that reduce reported performance findings to self-referential inputs or prior self-citations by construction. The central claims rest on external benchmark creation and model testing rather than any closed loop of prediction equaling input.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Human-curated examples across four defined levels capture the essential challenges of real-world text-to-CAD tasks
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Our benchmark comprises 600 human-curated examples spanning four levels: L1-L2 cover fundamental geometry with standard features, L3 introduces complex topology and freeform surfaces, and L4 extends to real-world domains beyond mechanical parts.
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We adopt three complementary metrics... Chamfer Distance (CD)... Invalidity Rate (IR)... Intersection over Union (IoU).
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Hierarchical Neural Coding for Controllable
Xu, Xiang and Jayaraman, Pradeep Kumar and Lambourne, Joseph G and Willis, Karl. Hierarchical Neural Coding for Controllable. International Conference on Machine Learning (
-
[2]
Wu, Rundi and Xiao, Chang and Zheng, Changxi , urldate =. 2021. doi:10.1109/ICCV48922.2021.00670 , shorttitle =
-
[3]
Xu, Xiang and Willis, Karl D. D. and Lambourne, Joseph G. and Cheng, Chin-Yi and Jayaraman, Pradeep Kumar and Furukawa, Yasutaka , urldate =. doi:10.48550/ARXIV.2207.04632 , shorttitle =
-
[4]
and Desai, Nishkrit and Willis, Karl D
Jayaraman, Pradeep Kumar and Lambourne, Joseph G. and Desai, Nishkrit and Willis, Karl D. D. and Sanghi, Aditya and Morris, Nigel J. W. , urldate =. doi:10.48550/ARXIV.2203.13944 , shorttitle =
-
[5]
Text2CAD: Generating Sequential
Khan, Mohammad Sadil and Sinha, Sankalp and Sheikh, Talha Uddin and Stricker, Didier and Ali, Sk Aziz and Afzal, Muhammad Zeshan , urldate =. Text2CAD: Generating Sequential. doi:10.48550/ARXIV.2409.17106 , shorttitle =
-
[6]
Text-to-CadQuery: A New Paradigm for CADgenerationwithscalablelargemodelcapabilities
Xie, Haoyang and Ju, Feng , urldate =. Text-to-. doi:10.48550/ARXIV.2505.06507 , shorttitle =
-
[7]
CAD-Recode: Reverse engineering CAD code from point clouds.arXiv preprint arXiv:2412.14042, 2024
Rukhovich, Danila and Dupont, Elona and Mallis, Dimitrios and Cherenkova, Kseniya and Kacem, Anis and Aouada, Djamila , urldate =. doi:10.48550/ARXIV.2412.14042 , shorttitle =
-
[8]
Koch, Sebastian and Matveev, Albert and Jiang, Zhongshi and Williams, Francis and Artemov, Alexey and Burnaev, Evgeny and Alexa, Marc and Zorin, Denis and Panozzo, Daniele , urldate =. 2019. doi:10.1109/CVPR.2019.00983 , shorttitle =
-
[9]
Willis, Karl D. D. and Pu, Yewen and Luo, Jieliang and Chu, Hang and Du, Tao and Lambourne, Joseph G. and Solar-Lezama, Armando and Matusik, Wojciech , urldate =. Fusion 360 gallery: a dataset and environment for programmatic. doi:10.1145/3450626.3459818 , shorttitle =
- [10]
-
[11]
doi:10.48550/ARXIV.2106.02711 , shorttitle =
Para, Wamiq Reyaz and Bhat, Shariq Farooq and Guerrero, Paul and Kelly, Tom and Mitra, Niloy and Guibas, Leonidas and Wonka, Peter , urldate =. doi:10.48550/ARXIV.2106.02711 , shorttitle =
-
[12]
Proceedings of the 40th International Conference on Machine Learning , pages=
Hierarchical neural coding for controllable CAD model generation , author=. Proceedings of the 40th International Conference on Machine Learning , pages=
-
[13]
Cad-coder: An open-source vision-language model for computer-aided design code generation , author=. International Design Engineering Technical Conferences and Computers and Information in Engineering Conference , volume=. 2025 , organization=
work page 2025
-
[14]
Li, Xingang and Sun, Yuewan and Sha, Zhenghui , date =. International Design Engineering Technical Conferences and Computers and Information in Engineering Conference , publisher =
-
[15]
Zhang, Shuming and Guan, Zhidong and Jiang, Hao and Ning, Tao and Wang, Xiaodong and Tan, Pingan , date =. Brep2Seq: a dataset and hierarchical deep learning network for reconstruction and generation of computer-aided design models , volume =
-
[16]
Chen, Mark and Tworek, Jerry and Jun, Heewoo and Yuan, Qiming and Pinto, Henrique Ponde de Oliveira and Kaplan, Jared and Edwards, Harri and Burda, Yuri and Joseph, Nicholas and Brockman, Greg and Ray, Alex and Puri, Raul and Krueger, Gretchen and Petrov, Michael and Khlaaf, Heidy and Sastry, Girish and Mishkin, Pamela and Chan, Brooke and Gray, Scott and...
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2107.03374
-
[17]
Qwen3 technical report , author=. arXiv preprint arXiv:2505.09388 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[18]
arXiv preprint arXiv:2501.19054 , year=
Text-to-cad generation through infusing visual feedback in large language models , author=. arXiv preprint arXiv:2501.19054 , year=
-
[19]
OpenAI and Achiam, Josh and Adler, Steven and Agarwal, Sandhini and Ahmad, Lama and Akkaya, Ilge and Aleman, Florencia Leoni and Almeida, Diogo and Altenschmidt, Janko and Altman, Sam and Anadkat, Shyamal and Avila, Red and Babuschkin, Igor and Balaji, Suchir and Balcom, Valerie and Baltescu, Paul and Bao, Haiming and Bavarian, Mohammad and Belgum, Jeff a...
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2303.08774
-
[20]
Gemini: A Family of Highly Capable Multimodal Models
Gemini: a family of highly capable multimodal models , author=. arXiv preprint arXiv:2312.11805 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[21]
doi:10.48550/ARXIV.2405.04434 , shorttitle =
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2405.04434
-
[22]
Qwen2.5-Coder Technical Report
Hui, Binyuan and Yang, Jian and Cui, Zeyu and Yang, Jiaxi and Liu, Dayiheng and Zhang, Lei and Liu, Tianyu and Zhang, Jiajun and Yu, Bowen and Lu, Keming and Dang, Kai and Fan, Yang and Zhang, Yichang and Yang, An and Men, Rui and Huang, Fei and Zheng, Bo and Miao, Yibo and Quan, Shanghaoran and Feng, Yunlong and Ren, Xingzhang and Ren, Xuancheng and Zhou...
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2409.12186
-
[23]
Efficient memory management for large language model serving with pagedattention,
Kwon, Woosuk and Li, Zhuohan and Zhuang, Siyuan and Sheng, Ying and Zheng, Lianmin and Yu, Cody Hao and Gonzalez, Joseph and Zhang, Hao and Stoica, Ion , urldate =. Efficient Memory Management for Large Language Model Serving with. Proceedings of the 29th Symposium on Operating Systems Principles , publisher =. doi:10.1145/3600006.3613165 , eventtitle =
-
[24]
Evaluating Large Language Models Trained on Code
Evaluating large language models trained on code , author=. arXiv preprint arXiv:2107.03374 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[25]
Measuring Mathematical Problem Solving With the MATH Dataset , author=. Sort , volume=
-
[26]
Gaia: a benchmark for general ai assistants , author=
- [27]
-
[28]
StarCoder: may the source be with you!
Starcoder: may the source be with you! , author=. arXiv preprint arXiv:2305.06161 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[29]
The International Conference on Learning Representations (ICLR) , year=
VITRUVION: A GENERATIVE MODEL OF PARAMETRIC CAD SKETCHES , author=. The International Conference on Learning Representations (ICLR) , year=
-
[30]
European Conference on Computer Vision , pages=
Extrudenet: Unsupervised inverse sketch-and-extrude for shape parsing , author=. European Conference on Computer Vision , pages=. 2022 , organization=
work page 2022
-
[31]
ACM Transactions on Graphics (TOG) , volume=
Brepgen: A b-rep generative diffusion model with structured latent geometry , author=. ACM Transactions on Graphics (TOG) , volume=. 2024 , publisher=
work page 2024
-
[32]
DreamFusion: Text-to-3D using 2D Diffusion
Dreamfusion: Text-to-3d using 2d diffusion , author=. arXiv preprint arXiv:2209.14988 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[33]
Shap-E: Generating Conditional 3D Implicit Functions
Shap-e: Generating conditional 3d implicit functions , author=. arXiv preprint arXiv:2305.02463 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[34]
Proceedings of the IEEE conference on computer vision and pattern recognition , pages=
A point set generation network for 3d object reconstruction from a single image , author=. Proceedings of the IEEE conference on computer vision and pattern recognition , pages=
-
[35]
European Conference on Computer Vision , pages=
Convolutional occupancy networks , author=. European Conference on Computer Vision , pages=. 2020 , organization=
work page 2020
-
[36]
ChatGLM: A Family of Large Language Models from GLM-130B to GLM-4 All Tools
Chatglm: A family of large language models from glm-130b to glm-4 all tools , author=. arXiv preprint arXiv:2406.12793 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[37]
MiniMax-01: Scaling Foundation Models with Lightning Attention
Minimax-01: Scaling foundation models with lightning attention , author=. arXiv preprint arXiv:2501.08313 , year=
work page internal anchor Pith review Pith/arXiv arXiv
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.