Recognition: 2 theorem links
· Lean TheoremText-to-CAD Evaluation with CADTests
Pith reviewed 2026-05-11 02:12 UTC · model grok-4.3
The pith
CADTestBench uses executable tests to evaluate and guide Text-to-CAD model generation.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
CADTestBench is the first test-based benchmark for Text-to-CAD, built on CADTests that execute checks for geometric and topological compliance with the input prompt; the same tests can also be used to guide model generation and yield baselines that surpass recent methods.
What carries the argument
CADTests: executable software tests that verify whether a generated CAD model satisfies the geometric and topological requirements implied by the text prompt.
If this is right
- Existing Text-to-CAD methods can be ranked by the fraction of CADTests they satisfy on a fixed prompt set.
- Generation pipelines can incorporate CADTest feedback at inference time to improve output quality.
- Training or fine-tuning objectives can be defined directly from test pass rates rather than indirect similarity scores.
- New methods can be developed by searching for models that maximize the number of satisfied CADTests.
Where Pith is reading between the lines
- The testing approach could extend to other structured generation tasks such as text-to-3D or parametric design where precise constraint satisfaction is required.
- Integrating CADTests into a reinforcement-learning loop might produce models that satisfy constraints more reliably than current supervised approaches.
- The benchmark data could be used to identify systematic failure modes in current generators, such as topology errors that visual metrics overlook.
Load-bearing premise
CADTests accurately and completely capture every geometric and topological requirement that an arbitrary text prompt implies, without false passes or false failures.
What would settle it
A text prompt and generated model pair where the model passes all applicable CADTests yet fails to match the prompt's intended shape or topology, or where a matching model fails the tests.
Figures
read the original abstract
Text-to-CAD has recently emerged as an important task with the potential to substantially accelerate design workflows. Despite its significance, there has been surprisingly little work on Text-to-CAD evaluation, and assessing CAD model generation performance remains a considerable challenge. In this work, we introduce a new evaluation perspective for Text-to-CAD based on automated testing. We propose CADTestBench, the first test-based benchmark for Text-to-CAD, based on CADTests, executable software tests that verify whether a generated CAD model satisfies the geometric and topological requirements of the input prompt. Using CADTestBench, we conduct comprehensive benchmarking of recent Text-to-CAD methods and further demonstrate that CADTests can also guide CAD model generation, yielding simple baselines that surpass performance of current methods. CADTestBench code and data are available at GitHub and Hugging Face dataset.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces CADTestBench, the first test-based benchmark for Text-to-CAD, based on CADTests—executable software tests that verify whether a generated CAD model satisfies the geometric and topological requirements of the input text prompt. Using this benchmark, the authors conduct comprehensive evaluation of recent Text-to-CAD methods and demonstrate that CADTests can also guide CAD model generation, producing simple baselines that surpass the performance of current methods. The code and data are released publicly via GitHub and Hugging Face.
Significance. If the CADTests are shown to be valid and complete, the work provides an objective, automated evaluation framework for an emerging task where assessment has been difficult, potentially standardizing comparisons and enabling better generation via test guidance. The public release of resources is a clear strength that could accelerate progress in the field.
major comments (3)
- [Abstract] Abstract: The claims of 'comprehensive benchmarking' of recent methods and that 'guided baselines surpass performance of current methods' are asserted without any quantitative results, metrics, baseline details, or controls provided. This leaves the central empirical claims unverifiable and unsupported in the available text.
- [CADTest construction (likely §3)] CADTest construction (likely §3): No details are given on the automated test generation process from arbitrary text prompts, nor is there any independent validation (e.g., human evaluation, inter-annotator agreement, or held-out test set) to confirm that CADTests faithfully capture all implied geometric and topological constraints without false positives or negatives. This is load-bearing, as both the benchmarking results and the guidance superiority claim rest on test correctness.
- [Experiments and guidance sections (likely §4-5)] Experiments and guidance sections (likely §4-5): The assertion that CADTests can guide generation to outperform prior methods requires explicit description of the guidance mechanism, the 'simple baselines,' the evaluation metrics, and controls for test validity. Without these, and given the risk of incomplete tests for non-trivial prompts, the superiority cannot be established.
minor comments (1)
- [Abstract] Abstract: The statement that 'CADTestBench code and data are available at GitHub and Hugging Face dataset' lacks specific repository names, links, or DOIs.
Simulated Author's Rebuttal
We thank the referee for their thorough and constructive review. We appreciate the acknowledgment of CADTestBench's potential to standardize evaluation in Text-to-CAD and the value of the public resource release. Below we respond point-by-point to the major comments and commit to revisions that will strengthen the clarity, verifiability, and completeness of the manuscript.
read point-by-point responses
-
Referee: [Abstract] Abstract: The claims of 'comprehensive benchmarking' of recent methods and that 'guided baselines surpass performance of current methods' are asserted without any quantitative results, metrics, baseline details, or controls provided. This leaves the central empirical claims unverifiable and unsupported in the available text.
Authors: The abstract is deliberately concise and therefore omits specific numerical results. The full manuscript reports quantitative benchmarking results, pass-rate metrics, baseline specifications, and experimental controls in Sections 4 and 5. To address the concern, we will revise the abstract to include a brief summary of the key quantitative findings (e.g., overall pass rates and relative improvements of the guided baselines). revision: yes
-
Referee: [CADTest construction (likely §3)] CADTest construction (likely §3): No details are given on the automated test generation process from arbitrary text prompts, nor is there any independent validation (e.g., human evaluation, inter-annotator agreement, or held-out test set) to confirm that CADTests faithfully capture all implied geometric and topological constraints without false positives or negatives. This is load-bearing, as both the benchmarking results and the guidance superiority claim rest on test correctness.
Authors: Section 3 describes the automated CADTest generation pipeline, which parses text prompts to extract geometric and topological constraints and emits executable test code. We agree that additional explicit detail and independent validation would strengthen the work. We will expand the section with a precise algorithmic description of the generation process and will add a new validation subsection reporting human evaluation results on a held-out sample of prompts, including inter-annotator agreement statistics and an analysis of false-positive/negative rates. revision: yes
-
Referee: [Experiments and guidance sections (likely §4-5)] Experiments and guidance sections (likely §4-5): The assertion that CADTests can guide generation to outperform prior methods requires explicit description of the guidance mechanism, the 'simple baselines,' the evaluation metrics, and controls for test validity. Without these, and given the risk of incomplete tests for non-trivial prompts, the superiority cannot be established.
Authors: Sections 4 and 5 present the experimental protocol, the guidance procedure (CADTests used for iterative verification and selection), the simple baselines (lightweight LLM-based generators augmented by test-driven filtering), and the primary metric (CADTest pass rate). We acknowledge that these descriptions can be made more explicit. We will add pseudocode for the guidance loop, full baseline configurations, ablation studies that isolate the contribution of test guidance, and a limitations paragraph that directly discusses the risk of incomplete tests for complex prompts together with mitigation strategies. revision: yes
Circularity Check
No circularity in CADTestBench derivation or test-guided generation claims
full rationale
The paper introduces CADTestBench as a novel test-based benchmark using executable CADTests to verify geometric/topological requirements implied by text prompts. Benchmarking of prior Text-to-CAD methods and the demonstration that CADTests can guide generation to produce superior simple baselines are presented as independent empirical applications of the new tests. No load-bearing steps reduce by construction to self-definition, fitted inputs renamed as predictions, or self-citation chains; the central claims rest on the external validity of the proposed tests rather than tautological equivalence to inputs.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Executable software tests can verify geometric and topological properties of CAD models against text prompt requirements
invented entities (2)
-
CADTests
no independent evidence
-
CADTestBench
no independent evidence
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
CADTESTS are executable software tests verifying whether a generated CAD model meets the design specifications of the input prompt... implemented as a Python code snippet executed on the B-rep m... using selectors along with B-Rep inspection primitives, including topology counts, bounding box dimensions, areas and volumes...
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We leverage mutation analysis both to measure the effectiveness of generated CADTESTS and to guide the design of more discriminative tests... mutation score is defined as the fraction of killed mutants
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Generating cad code with vision-language models for 3d designs.ICLR, 2025
Kamel Alrashedy, Pradyumna Tambwekar, Zulfiqar Zaidi, Megan Langwasser, Wei Xu, and Matthew Gombolay. Generating cad code with vision-language models for 3d designs.ICLR, 2025
work page 2025
-
[2]
Anthropic. The claude model spec. 2025. URL https://docs.anthropic.com/en/docs/ about-claude/claude-model-spec
work page 2025
-
[3]
Testing telecoms software with quviq quickcheck
Thomas Arts, John Hughes, Joakim Johansson, and Ulf Wiger. Testing telecoms software with quviq quickcheck. InProceedings of the 2006 ACM SIGPLAN Workshop on Erlang, 2006
work page 2006
-
[4]
Jacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen Jiang, Carrie J. Cai, Michael Terry, Quoc V . Le, and Charles Sutton. Program synthesis with large language models.ArXiv, 2021
work page 2021
-
[5]
Autodesk Inc. Fusion 360. https://www.autodesk.com/products/fusion-360, 2026. Integrated CAD, CAM, and CAE platform
work page 2026
-
[6]
Query2cad: Generating cad models using natural language queries.ArXiv, 2024
Akshay Badagabettu, Sai Sravan Yarlagadda, and Amir Barati Farimani. Query2cad: Generating cad models using natural language queries.ArXiv, 2024
work page 2024
- [7]
-
[8]
Cummings, Matthias Plappert, Fotios Chantzis, Elizabeth Barnes, Ariel Herbert-V oss, William H
Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Pondé, Jared Kaplan, Harrison Edwards, Yura Burda, Nicholas Joseph, Greg Brockman, Alex Ray, Raul Puri, Gretchen Krueger, Michael Petrov, Heidy Khlaaf, Girish Sastry, Pamela Mishkin, Brooke Chan, Scott Gray, Nick Ryder, Mikhail Pavlov, Alethea Power, Lukasz Kaiser, Mo Bavarian, Clemens Winter, Phi...
work page 2021
- [9]
-
[10]
Dassault Systèmes. Solidworks. https://www.solidworks.com, 2026. Professional CAD software for solid modeling and mechanical design
work page 2026
-
[11]
Richard A DeMillo, Richard J Lipton, and Frederick G Sayward. Program mutation: A new approach to program testing.Infotech State of the Art Report, Software Testing, 1979
work page 1979
-
[12]
Transcad: A hierarchical transformer for cad sequence inference from point clouds
Elona Dupont, Kseniya Cherenkova, Dimitrios Mallis, Gleb Gusev, Anis Kacem, and Djamila Aouada. Transcad: A hierarchical transformer for cad sequence inference from point clouds. In ECCV, 2024
work page 2024
-
[13]
Cad-coder: Text-to-cad generation with chain-of-thought and geometric reward.ArXiv, 2025
Yandong Guan, Xilin Wang, Xingxi Ming, Jing Zhang, Dong Xu, and Qian Yu. Cad-coder: Text-to-cad generation with chain-of-thought and geometric reward.ArXiv, 2025
work page 2025
-
[14]
Measuring Coding Challenge Competence With APPS
Dan Hendrycks, Steven Basart, Saurav Kadavath, Mantas Mazeika, Akul Arora, Ethan Guo, Collin Burns, Samir Puranik, Horace He, Dawn Xiaodong Song, and Jacob Steinhardt. Measur- ing coding challenge competence with apps.ArXiv, abs/2105.09938, 2021
work page internal anchor Pith review arXiv 2021
-
[15]
Pierce, Thomas Arts, and Ulf Norell
John Hughes, Benjamin C. Pierce, Thomas Arts, and Ulf Norell. Mysteries of dropbox: Property- based testing of a distributed synchronization service. In2016 IEEE International Conference on Software Testing, Verification and Validation (ICST), 2016. 10
work page 2016
-
[16]
An analysis and survey of the development of mutation testing
Yue Jia and Mark Harman. An analysis and survey of the development of mutation testing. IEEE Transactions on Software Engineering, 2011
work page 2011
-
[17]
Cad-llama: Leveraging large language models for computer-aided design parametric 3d model generation
Li Jiahao, Ma Weijian, Li Xueyang, Lou Yunzhong, Zhou Guichun, and Zhou Xiangdong. Cad-llama: Leveraging large language models for computer-aided design parametric 3d model generation. InCVPR, 2025
work page 2025
-
[18]
Davinci: A single-stage architecture for constrained cad sketch inference
Ahmet Serdar Karadeniz, Dimitrios Mallis, Nesryne Mejri, Kseniya Cherenkova, Anis Kacem, and Djamila Aouada. Davinci: A single-stage architecture for constrained cad sketch inference. InBMVC, 2024
work page 2024
-
[19]
Ahmet Serdar Karadeniz, Dimitrios Mallis, Nesryne Mejri, Kseniya Cherenkova, Anis Kacem, and Djamila Aouada. Picasso: A feed-forward framework for parametric inference of cad sketches via rendering self-supervision. InWACV, 2025
work page 2025
-
[20]
Micadangelo: Fine-grained reconstruction of constrained cad models from 3d scans.NeurIPS, 2025
Ahmet Serdar Karadeniz, Dimitrios Mallis, Danila Rukhovich, Kseniya Cherenkova, Anis Kacem, and Djamila Aouada. Micadangelo: Fine-grained reconstruction of constrained cad models from 3d scans.NeurIPS, 2025
work page 2025
-
[21]
Text2cad: Generating sequential cad designs from beginner-to-expert level text prompts.NeurIPS, 2024
Mohammad Sadil Khan, Sankalp Sinha, Talha Uddin, Didier Stricker, Sk Aziz Ali, and Muham- mad Zeshan Afzal. Text2cad: Generating sequential cad designs from beginner-to-expert level text prompts.NeurIPS, 2024
work page 2024
-
[22]
cadrille: Multi-modal cad reconstruction with online reinforcement learning.ArXiv, 2025
Maksim Kolodiazhnyi, Denis Tarasov, Dmitrii Zhemchuzhnikov, Alexander Nikulin, Ilya Zisman, Anna V orontsova, Anton Konushin, Vladislav Kurenkov, and Danila Rukhovich. cadrille: Multi-modal cad reconstruction with online reinforcement learning.ArXiv, 2025
work page 2025
-
[23]
Spoc: Search-based pseudocode to code.NIPS, 2019
Sumith Kulal, Panupong Pasupat, Kartik Chandra, Mina Lee, Oded Padon, Alexander Aiken, and Percy Liang. Spoc: Search-based pseudocode to code.NIPS, 2019
work page 2019
-
[24]
Unsuper- vised translation of programming languages.NIPS, 2020
Marie-Anne Lachaux, Baptiste Rozière, Lowik Chanussot, and Guillaume Lample. Unsuper- vised translation of programming languages.NIPS, 2020
work page 2020
-
[25]
J. Lambourne, Karl D. D. Willis, Pradeep Kumar Jayaraman, Aditya Sanghi, Peter Meltzer, and Hooman Shayani. Brepnet: A topological message passing system for solid models.2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2021
work page 2021
-
[26]
Xueyang Li, Yu Song, Yunzhong Lou, and Xiangdong Zhou. Cad translator: An effective drive for text to 3d parametric computer-aided design generative modeling. InACM Multimedia, 2024
work page 2024
-
[27]
Competition-level code generation with alphacode.Science, 2022
Yujia Li, David Choi, Junyoung Chung, Nate Kushman, Julian Schrittwieser, Rémi Leblond, Tom, Eccles, James Keeling, Felix Gimeno, Agustin Dal Lago, Thomas Hubert, Peter Choy, Cyprien de, Masson d’Autume, Igor Babuschkin, Xinyun Chen, Po-Sen Huang, Johannes Welbl, Sven Gowal, Alexey, Cherepanov, James Molloy, Daniel Jaymin Mankowitz, Esme Suther- land Robs...
work page 2022
-
[28]
MacIver, Zac Hatfield-Dodds, and Many Other Contributors
David R. MacIver, Zac Hatfield-Dodds, and Many Other Contributors. Hypothesis: A new approach to property-based testing.Journal of Open Source Software, 4(43):1891, 2019. doi: 10.21105/joss.01891. URLhttps://doi.org/10.21105/joss.01891
-
[29]
Sharp challenge 2023: Solving cad history and parameters recovery from point clouds and 3d scans
Dimitrios Mallis, Ali Sk Aziz, Elona Dupont, Kseniya Cherenkova, Ahmet Serdar Karadeniz, Mohammad Sadil Khan, Anis Kacem, Gleb Gusev, and Djamila Aouada. Sharp challenge 2023: Solving cad history and parameters recovery from point clouds and 3d scans. overview, datasets, metrics, and baselines. InCVPRW, 2023
work page 2023
-
[30]
Cad-assistant: Tool- augmented vllms as generic cad task solvers.ICCV, 2025
Dimitrios Mallis, Ahmet Serdar Karadeniz, Sebastian Cavada, Danila Rukhovich, Niki Foteinopoulou, Kseniya Cherenkova, Anis Kacem, and Djamila Aouada. Cad-assistant: Tool- augmented vllms as generic cad task solvers.ICCV, 2025
work page 2025
-
[31]
From idea to cad: A language model-driven multi-agent system for collaborative design
Felix Ocker, Stefan Menzel, Ahmed Sadik, and Thiago Rios. From idea to cad: A language model-driven multi-agent system for collaborative design. InArxiv, 2025. 11
work page 2025
-
[32]
OpenAI. Gpt-4 technical report.arXiv preprint arXiv:2303.08774, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[33]
Zhang, Yue Jia, Yves Le Traon, and Mark Harman
Mike Papadakis, Marinos Kintis, Jie M. Zhang, Yue Jia, Yves Le Traon, and Mark Harman. Chapter six - mutation testing advances: An analysis and survey.Adv. Comput., 2019
work page 2019
-
[34]
Cad-recode: Reverse engineering cad code from point clouds.ICCV, 2025
Danila Rukhovich, Elona Dupont, Dimitrios Mallis, Kseniya Cherenkova, Anis Kacem, and Djamila Aouada. Cad-recode: Reverse engineering cad code from point clouds.ICCV, 2025
work page 2025
-
[35]
Vitruvion: A generative model of parametric cad sketches
Ari Seff, Wenda Zhou, Nick Richardson, and Ryan P Adams. Vitruvion: A generative model of parametric cad sketches. InICLR, 2022
work page 2022
-
[36]
Lambourne, Tolga Birdal, and Leonidas J
Mikaela Angelina Uy, Yen-Yu Chang, Minhyuk Sung, Purvi Goel, J. Lambourne, Tolga Birdal, and Leonidas J. Guibas. Point2cyl: Reverse engineering 3d objects from point clouds to extrusion cylinders.2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2021
work page 2022
-
[37]
Can large language models write good property-based tests?ArXiv, 2024
Vasudev Vikram, Caroline Lemieux, and Rohan Padhye. Can large language models write good property-based tests?ArXiv, 2024
work page 2024
-
[38]
Text-to-cad generation through infusing visual feedback in large language models.ICLR, 2025
Ruiyu Wang, Yu Yuan, Shizhao Sun, and Jiang Bian. Text-to-cad generation through infusing visual feedback in large language models.ICLR, 2025
work page 2025
-
[39]
Deepcad: A deep generative network for computer- aided design models
Rundi Wu, Chang Xiao, and Changxi Zheng. Deepcad: A deep generative network for computer- aided design models. InCVPR, 2021
work page 2021
-
[40]
Guibas, Dahua Lin, and Gordon Wetzstein
Tong Wu, Guandao Yang, Zhibing Li, Kai Zhang, Ziwei Liu, Leonidas J. Guibas, Dahua Lin, and Gordon Wetzstein. Gpt-4v(ision) is a human-aligned evaluator for text-to-3d generation. 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024
work page 2024
-
[41]
React: Synergizing reasoning and acting in language models.ArXiv, 2022
Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. React: Synergizing reasoning and acting in language models.ArXiv, 2022
work page 2022
-
[42]
Text2cad: Text to 3d cad generation via technical drawings.ArXiv, abs/2411.06206, 2024
Mohsen Yavartanoo, Sangmin Hong, Reyhaneh Neshatavar, and Kyoung Mu Lee. Text2cad: Text to 3d cad generation via technical drawings.ArXiv, abs/2411.06206, 2024
-
[43]
Yu Yuan, Shizhao Sun, Qi Liu, and Jiang Bian. Cad-editor: A locate-then-infill framework with automated training data synthesis for text-based cad editing.ICML, 2025
work page 2025
-
[44]
Xiang Yue, Yuansheng Ni, Kai Zhang, Tianyu Zheng, Ruoqi Liu, Ge Zhang, Samuel Stevens, Dongfu Jiang, Weiming Ren, Yuxuan Sun, Cong Wei, Botao Yu, Ruibin Yuan, Renliang Sun, Ming Yin, Boyuan Zheng, Zhenzhu Yang, Yibo Liu, Wenhao Huang, Huan Sun, Yu Su, and Wenhu Chen. Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for exp...
work page 2023
-
[45]
V e r i f i e s t h e s h a p e h a s e x a c t l y o n e s h e l l
Zhanwei Zhang, Shizhao Sun, Wenxiao Wang, Deng Cai, and Jiang Bian. Flexcad: Unified and versatile controllable cad generation with fine-tuned large language models. InICLR, 2025. 12 Text-to-CAD Evaluation with CADTESTS Supplementary Material This supplementary material includes additional details and illustrations that were not included on the main paper...
work page 2025
-
[46]
In total, 1,275cad mutants were generated. More examples of mutated CAD BReps are shown in Fig. 12. import cadquery as cq def create_cad () -> cq.Workplane: cylinder_diameter:float = 1.31858 cylinder_height:float = 0.468766 cylinder:cq.Workplane = ( cq.Workplane ( "XY" ) .cylinder (cylinder_height, cylinder_diameter /2) ) board_length:float = 1.5 board_wi...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.