MUSE: Benchmarking Manufacturable, Functional, and Assemblable Text-to-CAD Generation

Xiao-Ming Wu; Xiaoyu Dong; Zhi Li

arxiv: 2605.28579 · v2 · pith:3QMOJ2UBnew · submitted 2026-05-27 · 💻 cs.AI

MUSE: Benchmarking Manufacturable, Functional, and Assemblable Text-to-CAD Generation

Xiaoyu Dong , Zhi Li , Xiao-Ming Wu This is my paper

Pith reviewed 2026-06-29 11:50 UTC · model grok-4.3

classification 💻 cs.AI

keywords text-to-cadbenchmarklarge language modelsmanufacturabilityfunctionalityassemblabilityB-Rep assembliesVLM evaluation

0 comments

The pith

MUSE shows LLMs generate CAD code and geometry but rarely meet criteria for functional, manufacturable assemblies.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that existing Text-to-CAD benchmarks rely on single-part models and geometric similarity scores that ignore real engineering needs. It introduces MUSE as a new benchmark using complex editable B-Rep assemblies paired with design specifications. Evaluation proceeds through code executability, geometric validity, and a final rubric stage that checks functionality, manufacturability, and assemblability via a VLM judge validated against human ratings. Experiments across models reveal a consistent drop in success at each stage, with the weakest performance on the engineering criteria. A reader would care because this gap explains why current text-driven generation has not yet reached industrial product design.

Core claim

The paper claims that Text-to-CAD must be judged by whether generated models satisfy practical design intent through a three-stage protocol of code check, geometric check, and rubric-based alignment on manufacturability, functionality, and assemblability; experiments demonstrate a failure cascade in which even strong LLMs achieve only limited success on the final engineering criteria.

What carries the argument

Three-stage evaluation protocol that ends with design-specific rubrics scored by a VLM judge to measure alignment with functionality, manufacturability, and assemblability.

If this is right

Text-to-CAD systems must incorporate engineering constraints during generation rather than relying on post-hoc geometric fixes.
Benchmarks should shift from single-part shape matching to multi-part assemblies with explicit design specifications.
Progress metrics should track success rates on fine-grained criteria such as assemblability instead of overall geometric similarity.
Evaluation frameworks need scalable judges that can be trusted on domain-specific rubrics.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Training pipelines for CAD LLMs may need explicit feedback loops that simulate manufacturing and assembly checks.
The benchmark could be adapted to test specific manufacturing methods such as injection molding or CNC machining constraints.
Future work might explore whether adding simulation-based rewards during generation closes the observed failure cascade.

Load-bearing premise

The rubric-based VLM judge gives assessments of functionality, manufacturability, and assemblability that match human judgments.

What would settle it

A follow-up study in which human experts score a representative sample of generated models on the same rubrics and obtain substantially different pass rates from the VLM.

Figures

Figures reproduced from arXiv: 2605.28579 by Xiao-Ming Wu, Xiaoyu Dong, Zhi Li.

**Figure 2.** Figure 2: A unified assembly graph can correspond to different designs under different parameter [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: Overview of our dataset construction. (a) Each benchmark instance provides a Design [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 4.** Figure 4: Illustration of the proposed evaluation system and rubric generation process. Given a [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗

**Figure 5.** Figure 5: Dataset statistics of MUSE, showing the distributions of (a) manufacturing methods, (b) materials, and (c) connection methods across all design instances. 4.1 Data Distribution of MUSE MUSE comprises 106 design instances spanning a wide range of manufacturing processes, materials, and connection methods, as summarized in [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗

read the original abstract

Large language models (LLMs) have recently advanced text-driven 3D generation, yet Text-to-CAD remains far from supporting industrial product design. Existing benchmarks focus primarily on generating single-part CAD models and evaluate them using geometric similarity metrics that fail to capture functionality, manufacturability, and assemblability. To address this gap, we introduce MUSE, a Text-to-CAD benchmark focused on complex, editable boundary representation (B-Rep) assemblies. MUSE pairs practical design instances with structured Design Specifications and evaluates generated models through a three-stage protocol: code check, geometric check, and design-intent alignment. The final stage uses design-specific rubrics to assess functionality, manufacturability, and assemblability, moving beyond shape matching toward practical design quality. To enable scalable evaluation, we use a rubric-based visual language model (VLM) judge and validate its reliability through human annotation. Experiments on closed-source and open-source LLMs reveal a clear failure cascade from executable code to valid geometry and finally to engineering-ready design, with even the strongest models achieving limited success on fine-grained engineering criteria. Together, MUSE provides a realistic benchmark and evaluation framework for advancing Text-to-CAD from geometric generation toward true engineering design. Our project website, including the leaderboard, dataset, and code, is available at https://dong7313.github.io/muse-benchmark/.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

MUSE brings a more engineering-focused benchmark to Text-to-CAD but its main findings rest on a VLM judge whose validation details are still thin.

read the letter

The core contribution is a benchmark built around complex B-Rep assemblies rather than single parts, paired with design specifications and evaluated in three stages: code execution, geometric validity, then rubric-based checks for functionality, manufacturability, and assemblability. The use of a VLM judge to score the rubrics at scale is a practical choice, and the paper reports that human annotation was used to check its reliability. Experiments across closed and open LLMs show the expected progressive drop in success, with even strong models weak on the detailed engineering criteria.

This setup moves evaluation past geometric similarity metrics, which is a clear improvement for anyone who cares about downstream use. The dataset, leaderboard, and protocol give the community something concrete to test against.

The soft spot is the VLM judge. The stress-test note correctly identifies this as load-bearing. The abstract states that human validation was performed, but supplies no agreement numbers, sample sizes, or discussion of edge cases such as tolerance issues or kinematic constraints. Without those specifics it is difficult to know whether the reported failure rates reflect model behavior or judge noise. If the full paper contains only high-level validation, that section needs tightening.

The work is aimed at researchers building or benchmarking LLM-driven CAD systems. Readers who want a test that rewards practical design quality will find it useful. It is coherent enough and addresses a real gap, so it deserves referee time rather than a desk reject.

Referee Report

2 major / 2 minor

Summary. The paper introduces MUSE, a benchmark for text-to-CAD generation targeting complex, editable B-Rep assemblies. It pairs design instances with structured specifications and evaluates outputs via a three-stage protocol (code check, geometric check, design-intent alignment) that uses rubric-based VLM scoring for functionality, manufacturability, and assemblability. The VLM judge is validated by human annotation. Experiments on closed- and open-source LLMs demonstrate a failure cascade, with even strong models showing limited success on fine-grained engineering criteria. The work releases dataset, code, and leaderboard.

Significance. If the VLM validation and failure-cascade results hold, MUSE supplies a needed shift from geometric similarity metrics to practical engineering criteria, which could steer Text-to-CAD research toward industrially relevant outputs. The release of the dataset, code, and leaderboard is a concrete strength that supports reproducibility and community follow-up.

major comments (2)

[Evaluation Protocol / VLM Judge Validation] The design-intent alignment stage (final stage of the three-stage protocol) relies on rubric-based VLM scoring of functionality, manufacturability, and assemblability; the manuscript asserts this judge is validated by human annotation, yet provides no quantitative details on sample size, inter-rater agreement (e.g., Cohen’s kappa or percentage agreement), or coverage of edge cases such as tolerance stack-up and assembly kinematics. Because the reported “limited success on fine-grained engineering criteria” and the failure-cascade conclusion rest directly on these scores, the validation evidence must be expanded to confirm the judge does not introduce systematic bias.
[Experiments] The abstract and high-level description state that experiments reveal a clear failure cascade, but the provided text supplies no quantitative tables or per-stage success rates (e.g., percentage of models passing code check vs. geometric check vs. design-intent alignment). Without these numbers and error bars, it is impossible to verify the magnitude or statistical significance of the cascade or to compare closed- versus open-source models on the engineering criteria.

minor comments (2)

[Benchmark Construction] The abstract refers to “practical design instances” and “structured Design Specifications” without defining the source or construction process of the benchmark instances; a short paragraph or table in §3 would clarify dataset provenance.
Figure captions and the project website URL are mentioned but not cross-referenced in the text; ensure every figure is cited at the point of first discussion.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive comments, which help strengthen the clarity and rigor of our work. We address each major comment below and have revised the manuscript to incorporate the requested details.

read point-by-point responses

Referee: The design-intent alignment stage (final stage of the three-stage protocol) relies on rubric-based VLM scoring of functionality, manufacturability, and assemblability; the manuscript asserts this judge is validated by human annotation, yet provides no quantitative details on sample size, inter-rater agreement (e.g., Cohen’s kappa or percentage agreement), or coverage of edge cases such as tolerance stack-up and assembly kinematics. Because the reported “limited success on fine-grained engineering criteria” and the failure-cascade conclusion rest directly on these scores, the validation evidence must be expanded to confirm the judge does not introduce systematic bias.

Authors: We agree that the validation of the VLM judge requires quantitative support. In the revised manuscript we expand the validation subsection to report the human annotation sample size, inter-rater agreement statistics (Cohen’s kappa and percentage agreement), and explicit coverage of edge cases including tolerance stack-up and assembly kinematics. These additions demonstrate that the VLM scores align with human judgments and do not introduce systematic bias. revision: yes
Referee: The abstract and high-level description state that experiments reveal a clear failure cascade, but the provided text supplies no quantitative tables or per-stage success rates (e.g., percentage of models passing code check vs. geometric check vs. design-intent alignment). Without these numbers and error bars, it is impossible to verify the magnitude or statistical significance of the cascade or to compare closed- versus open-source models on the engineering criteria.

Authors: We concur that per-stage quantitative results are essential for verifying the failure cascade. The revised manuscript now includes dedicated tables reporting success rates at each protocol stage (code check, geometric check, design-intent alignment) for all models, together with error bars. These tables enable direct assessment of the cascade magnitude, statistical significance, and closed- versus open-source model differences on the engineering criteria. revision: yes

Circularity Check

0 steps flagged

No circularity: benchmark and evaluation protocol are self-contained contributions

full rationale

The paper introduces MUSE as a new benchmark with a three-stage evaluation protocol (code check, geometric check, design-intent alignment via rubrics) and reports empirical results on LLMs. No equations, fitted parameters, or derived predictions exist that could reduce to inputs by construction. The VLM judge is presented as an external tool whose reliability is checked via separate human annotation, which is an independent empirical step rather than a self-referential loop. No self-citations are invoked as load-bearing uniqueness theorems or ansatzes. This matches the default case of a benchmark paper whose central claims rest on new data and protocol rather than internal re-derivation.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review provides no information on free parameters, axioms, or invented entities; the benchmark itself is the primary contribution.

pith-pipeline@v0.9.1-grok · 5775 in / 1016 out tokens · 35408 ms · 2026-06-29T11:50:57.210447+00:00 · methodology

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

P3D-Bench: Benchmarking MLLMs for Parametric 3D Generation and Structural Reasoning
cs.CV 2026-06 unverdicted novelty 7.0

P3D-Bench is a benchmark with three task families that scores MLLMs on generating executable parametric 3D programs, finding failures in precise geometry and part assembly.

Reference graph

Works this paper leans on

36 extracted references · 14 canonical work pages · cited by 1 Pith paper · 3 internal anchors

[1]

Text2CAD: generating sequential cad designs from beginner-to-expert level text prompts.Advances in Neural Information Processing Systems, 37:7552–7579, 2024

Mohammad S Khan, Sankalp Sinha, Talha U Sheikh, Didier Stricker, Sk A Ali, and Muham- mad Z Afzal. Text2CAD: generating sequential cad designs from beginner-to-expert level text prompts.Advances in Neural Information Processing Systems, 37:7552–7579, 2024

2024
[2]

TripoSR: Fast 3D Object Reconstruction from a Single Image

Dmitry Tochilkin, David Pankratz, Zexiang Liu, Zixuan Huang, Adam Letts, Yangguang Li, Ding Liang, Christian Laforte, Varun Jampani, and Yan-Pei Cao. Triposr: fast 3d object reconstruction from a single image.arXiv preprint arXiv:2403.02151, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[3]

SeparateGen: semantic component-based 3D character generation from single images.IEEE Transactions on Visualization and Computer Graphics, 2026

Dong-Yang Li, Yi-Long Liu, Zi-Xian Liu, Yan-Pei Cao, Meng-Hao Guo, and Shi-Min Hu. SeparateGen: semantic component-based 3D character generation from single images.IEEE Transactions on Visualization and Computer Graphics, 2026

2026
[4]

Text-to-CAD generation through infusing visual feedback in large language models

Ruiyu Wang, Yu Yuan, Shizhao Sun, and Jiang Bian. Text-to-CAD generation through infusing visual feedback in large language models. InProceedings of the International Conference on Machine Learning (ICML), 2025

2025
[5]

Creating novel furniture through topology optimization and advanced manufacturing.Rapid Prototyping Journal, 27(9):1749–1758, 2021

Jiaming Ma, Zhi Li, Zi-Long Zhao, and Yi Min Xie. Creating novel furniture through topology optimization and advanced manufacturing.Rapid Prototyping Journal, 27(9):1749–1758, 2021

2021
[6]

DeepCAD: A deep generative network for computer-aided design models

Rundi Wu, Chang Xiao, and Changxi Zheng. DeepCAD: A deep generative network for computer-aided design models. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 6772–6782, October 2021

2021
[7]

Text2CAD: generating sequential CAD models from beginner-to- expert level text prompts

Mohammad Sadil Khan, Sankalp Sinha, Talha Uddin Sheikh, Didier Stricker, Sk Aziz Ali, and Muhammad Zeshan Afzal. Text2CAD: generating sequential CAD models from beginner-to- expert level text prompts. InAdvances in Neural Information Processing Systems (NeurIPS), pages 7552–7579, 2024

2024
[8]

CAD- GPT: synthesising CAD construction sequence with spatial reasoning-enhanced multimodal LLMs

Siyu Wang, Cailian Chen, Xinyi Le, Qimin Xu, Lei Xu, Yanzhou Zhang, and Jie Yang. CAD- GPT: synthesising CAD construction sequence with spatial reasoning-enhanced multimodal LLMs. InProceedings of the AAAI Conference on Artificial Intelligence, volume 39, pages 7880–7888, 2025

2025
[9]

CAD translator: an effective drive for text to 3D parametric computer-aided design generative modeling

Xueyang Li, Yu Song, Yunzhong Lou, and Xiangdong Zhou. CAD translator: an effective drive for text to 3D parametric computer-aided design generative modeling. InProceedings of the ACM International Conference on Multimedia (ACM MM 2024), Poster, 2024

2024
[10]

ArtiCAD: Articulated CAD Assembly Design via Multi-Agent Code Generation

Yuan Shui, Yandong Guan, Zhanwei Zhang, Juncheng Hu, Jing Zhang, Dong Xu, and Qian Yu. ArtiCAD: articulated CAD assembly design via multi-agent code generation.arXiv preprint arXiv:2604.10992, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[11]

MLLM-as-a-Judge: Assessing multimodal LLM-as-a-judge with vision-language benchmark

Dongping Chen, Ruoxi Chen, Shilin Zhang, Yinuo Liu, Yaochen Wang, Huichi Zhou, Qihui Zhang, Yao Wan, Pan Zhou, and Lichao Sun. MLLM-as-a-Judge: Assessing multimodal LLM-as-a-judge with vision-language benchmark. InInternational Conference on Machine Learning (ICML), 2024

2024
[12]

MLLM-Bench: evaluating multimodal LLMs with per-sample criteria.arXiv preprint arXiv:2311.13951, 2024

Wentao Ge, Shunian Chen, Guiming Hardy Chen, et al. MLLM-Bench: evaluating multimodal LLMs with per-sample criteria.arXiv preprint arXiv:2311.13951, 2024. 10

work page arXiv 2024
[13]

Md Tahmid Rahman Laskar, Mohammed Saidul Islam, Ridwan Mahbub, Ahmed Masry, Miza- nur Rahman, Amran Bhuiyan, Mir Tafseer Nayeem, Shafiq Joty, Enamul Hoque, and Jimmy Huang. Judging the judges: Can large vision-language models fairly evaluate chart compre- hension and reasoning? InProceedings of the 63rd Annual Meeting of the Association for Computational ...

2025
[14]

Prometheus-vision: Vision-language model as a judge for fine-grained evaluation

Seongyun Lee, Seungone Kim, Sue Park, Geewook Kim, and Minjoon Seo. Prometheus-vision: Vision-language model as a judge for fine-grained evaluation. InFindings of the Association for Computational Linguistics: ACL 2024, pages 11286–11315, 2024

2024
[15]

Human-Aligned MLLM judges for fine- grained image editing evaluation: a benchmark, framework, and analysis.arXiv preprint arXiv:2602.13028, 2026

Runzhou Liu, Hailey Weingord, Sejal Mittal, et al. Human-Aligned MLLM judges for fine- grained image editing evaluation: a benchmark, framework, and analysis.arXiv preprint arXiv:2602.13028, 2026

work page arXiv 2026
[16]

A high-quality dataset and reliable evaluation for interleaved image-text generation.arXiv preprint arXiv:2506.09427, 2025

Yukang Feng, Jianwen Sun, Chuanhao Li, et al. A high-quality dataset and reliable evaluation for interleaved image-text generation.arXiv preprint arXiv:2506.09427, 2025

work page arXiv 2025
[17]

Genarena: How can we achieve human-aligned evaluation for visual generation tasks?arXiv preprint arXiv:2602.06013, 2026

Ruihang Li, Leigang Qu, Jingxu Zhang, Dongnan Gui, Mengde Xu, Xiaosong Zhang, Han Hu, Wenjie Wang, and Jiaqi Wang. Genarena: How can we achieve human-aligned evaluation for visual generation tasks?arXiv preprint arXiv:2602.06013, 2026

work page arXiv 2026
[18]

K-Sort eval: efficient preference evaluation for visual generation via corrected VLM-as-a-Judge

Zhikai Li, Jiatong Li, Xuewen Liu, et al. K-Sort eval: efficient preference evaluation for visual generation via corrected VLM-as-a-Judge. InInternational Conference on Learning Representations (ICLR), 2026

2026
[19]

Llava-critic: Learning to evaluate multimodal models

Tianyi Xiong, Xiyao Wang, Dong Guo, Qinghao Ye, Haoqi Fan, Quanquan Gu, Heng Huang, and Chunyuan Li. Llava-critic: Learning to evaluate multimodal models. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 13618–13628, 2025

2025
[20]

Advancing multimodal judge models through a capability-oriented benchmark and mcts-driven data generation.arXiv preprint arXiv:2603.00546, 2026

Zeyu Chen, Huanjin Yao, Ziwang Zhao, and Min Yang. Advancing multimodal judge models through a capability-oriented benchmark and mcts-driven data generation.arXiv preprint arXiv:2603.00546, 2026

work page arXiv 2026
[21]

Multi-Crit: Benchmarking multimodal judges on pluralistic criteria-following.arXiv preprint arXiv:2511.21662, 2025

Tianyi Xiong, Yi Ge, Ming Li, et al. Multi-Crit: Benchmarking multimodal judges on pluralistic criteria-following.arXiv preprint arXiv:2511.21662, 2025

work page arXiv 2025
[22]

CADSmith: Multi-Agent CAD Generation with Programmatic Geometric Validation,

Jesse Barkley, Rumi Loghmani, and Amir Barati Farimani. Cadsmith: Multi-agent cad genera- tion with programmatic geometric validation.arXiv preprint arXiv:2603.26512, 2026

work page arXiv 2026
[24]

Codegen-3d: A benchmark for evaluating llms in zero-shot and iterative 3d modeling in blender.IEEE Access, 2026

Hao Ji, Kotha Aditya, Sebastian Escalante, and Yunjian Qiu. Codegen-3d: A benchmark for evaluating llms in zero-shot and iterative 3d modeling in blender.IEEE Access, 2026

2026
[26]

EvoCAD: evolutionary CAD code generation with vision language models

Tobias Preintner, Weixuan Yuan, Adrian König, Thomas Bäck, Elena Raponi, and Niki Van Stein. EvoCAD: evolutionary CAD code generation with vision language models. In2025 IEEE 37th International Conference on Tools with Artificial Intelligence (ICTAI), pages 504–511. IEEE, 2025

2025
[27]

Generating CAD code with vision-language models for 3D designs.arXiv preprint arXiv:2410.05340, 2024

Kamel Alrashedy, Pradyumna Tambwekar, Zulfiqar Zaidi, Megan Langwasser, Wei Xu, and Matthew Gombolay. Generating CAD code with vision-language models for 3D designs.arXiv preprint arXiv:2410.05340, 2024

work page arXiv 2024
[28]

Pointer-CAD: Unifying B-Rep and Command Sequences via Pointer-based Edges & Faces Selection

Dacheng Qi, Chenyu Wang, Jingwei Xu, Tianzhe Chu, Zibo Zhao, Wen Liu, Wenrui Ding, Yi Ma, and Shenghua Gao. Pointer-CAD: unifying B-Rep and command sequences via pointer- based edges & faces selection.arXiv preprint arXiv:2603.04337, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[29]

Text-to-CAD generation through infusing visual feedback in large language models.arXiv preprint arXiv:2501.19054, 2025

Ruiyu Wang, Yu Yuan, Shizhao Sun, and Jiang Bian. Text-to-CAD generation through infusing visual feedback in large language models.arXiv preprint arXiv:2501.19054, 2025. 11

work page arXiv 2025
[30]

CAD- MLLM: Unifying multimodality-conditioned CAD generation with MLLM.arXiv preprint arXiv:2411.04954, 2024

Jingwei Xu, Chenyu Wang, Zibo Zhao, Wen Liu, Yi Ma, and Shenghua Gao. CAD- MLLM: Unifying multimodality-conditioned CAD generation with MLLM.arXiv preprint arXiv:2411.04954, 2024

work page arXiv 2024
[31]

Automated CAD modeling sequence generation from text descriptions via transformer-based large language models

Jianxing Liao, Junyan Xu, Yatao Sun, Maowen Tang, Sicheng He, Jingxian Liao, Shui Yu, Yun Li, and Xiaohong Guan. Automated CAD modeling sequence generation from text descriptions via transformer-based large language models. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 21720–21748, 2025

2025
[32]

Text2CAD: text to 3D CAD generation via technical drawings.arXiv preprint arXiv:2411.06206, 2024

Mohsen Yavartanoo, Sangmin Hong, Reyhaneh Neshatavar, and Kyoung Mu Lee. Text2CAD: text to 3D CAD generation via technical drawings.arXiv preprint arXiv:2411.06206, 2024

work page arXiv 2024
[33]

FLASK: Fine-grained language model evaluation based on alignment skill sets

Seonghyeon Ye, Doyoung Kim, Sungdong Kim, Sahana Hwang, Seungone Kim, Yongrae Jo, James Thorne, Juho Kim, and Minjoon Seo. FLASK: Fine-grained language model evaluation based on alignment skill sets. InICLR, 2024

2024
[34]

G-Eval: NLG evaluation using GPT-4 with better human alignment

Yang Liu, Dan Iter, Yichong Xu, Shuohang Wang, Ruochen Xu, and Chenguang Zhu. G-Eval: NLG evaluation using GPT-4 with better human alignment. InEMNLP, 2023

2023
[35]

Keep assembly split unchanged

Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric P. Xing, Hao Zhang, Joseph E. Gonzalez, and Ion Stoica. Judging LLM-as-a-Judge with MT-Bench and chatbot arena. InNeurIPS Datasets and Benchmarks Track, 2023. A Engineering Knowledge Tables for Manufacturability To systematically eva...

2023
[36]

‘<Task_Doc>‘: the design specification, including design goals, component list, parameter ranges, and assembly graph
[37]

‘<Reference_Code>‘: the ground-truth CAD logic and spatial coordinates
[38]

Core Focus

‘<Reference_SVG>‘: the visual anchor / reference image. # Output Template Strictly follow the Markdown template below. You must use the exact terms ‘<Reference_SVG>‘ and ‘<Generated_SVG>‘ in the rubric. Write the instructions as if you are directly guiding the downstream judge. Do not include a separate "Core Focus" field. Instead, merge all necessary ins...

[1] [1]

Text2CAD: generating sequential cad designs from beginner-to-expert level text prompts.Advances in Neural Information Processing Systems, 37:7552–7579, 2024

Mohammad S Khan, Sankalp Sinha, Talha U Sheikh, Didier Stricker, Sk A Ali, and Muham- mad Z Afzal. Text2CAD: generating sequential cad designs from beginner-to-expert level text prompts.Advances in Neural Information Processing Systems, 37:7552–7579, 2024

2024

[2] [2]

TripoSR: Fast 3D Object Reconstruction from a Single Image

Dmitry Tochilkin, David Pankratz, Zexiang Liu, Zixuan Huang, Adam Letts, Yangguang Li, Ding Liang, Christian Laforte, Varun Jampani, and Yan-Pei Cao. Triposr: fast 3d object reconstruction from a single image.arXiv preprint arXiv:2403.02151, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[3] [3]

SeparateGen: semantic component-based 3D character generation from single images.IEEE Transactions on Visualization and Computer Graphics, 2026

Dong-Yang Li, Yi-Long Liu, Zi-Xian Liu, Yan-Pei Cao, Meng-Hao Guo, and Shi-Min Hu. SeparateGen: semantic component-based 3D character generation from single images.IEEE Transactions on Visualization and Computer Graphics, 2026

2026

[4] [4]

Text-to-CAD generation through infusing visual feedback in large language models

Ruiyu Wang, Yu Yuan, Shizhao Sun, and Jiang Bian. Text-to-CAD generation through infusing visual feedback in large language models. InProceedings of the International Conference on Machine Learning (ICML), 2025

2025

[5] [5]

Creating novel furniture through topology optimization and advanced manufacturing.Rapid Prototyping Journal, 27(9):1749–1758, 2021

Jiaming Ma, Zhi Li, Zi-Long Zhao, and Yi Min Xie. Creating novel furniture through topology optimization and advanced manufacturing.Rapid Prototyping Journal, 27(9):1749–1758, 2021

2021

[6] [6]

DeepCAD: A deep generative network for computer-aided design models

Rundi Wu, Chang Xiao, and Changxi Zheng. DeepCAD: A deep generative network for computer-aided design models. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 6772–6782, October 2021

2021

[7] [7]

Text2CAD: generating sequential CAD models from beginner-to- expert level text prompts

Mohammad Sadil Khan, Sankalp Sinha, Talha Uddin Sheikh, Didier Stricker, Sk Aziz Ali, and Muhammad Zeshan Afzal. Text2CAD: generating sequential CAD models from beginner-to- expert level text prompts. InAdvances in Neural Information Processing Systems (NeurIPS), pages 7552–7579, 2024

2024

[8] [8]

CAD- GPT: synthesising CAD construction sequence with spatial reasoning-enhanced multimodal LLMs

Siyu Wang, Cailian Chen, Xinyi Le, Qimin Xu, Lei Xu, Yanzhou Zhang, and Jie Yang. CAD- GPT: synthesising CAD construction sequence with spatial reasoning-enhanced multimodal LLMs. InProceedings of the AAAI Conference on Artificial Intelligence, volume 39, pages 7880–7888, 2025

2025

[9] [9]

CAD translator: an effective drive for text to 3D parametric computer-aided design generative modeling

Xueyang Li, Yu Song, Yunzhong Lou, and Xiangdong Zhou. CAD translator: an effective drive for text to 3D parametric computer-aided design generative modeling. InProceedings of the ACM International Conference on Multimedia (ACM MM 2024), Poster, 2024

2024

[10] [10]

ArtiCAD: Articulated CAD Assembly Design via Multi-Agent Code Generation

Yuan Shui, Yandong Guan, Zhanwei Zhang, Juncheng Hu, Jing Zhang, Dong Xu, and Qian Yu. ArtiCAD: articulated CAD assembly design via multi-agent code generation.arXiv preprint arXiv:2604.10992, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[11] [11]

MLLM-as-a-Judge: Assessing multimodal LLM-as-a-judge with vision-language benchmark

Dongping Chen, Ruoxi Chen, Shilin Zhang, Yinuo Liu, Yaochen Wang, Huichi Zhou, Qihui Zhang, Yao Wan, Pan Zhou, and Lichao Sun. MLLM-as-a-Judge: Assessing multimodal LLM-as-a-judge with vision-language benchmark. InInternational Conference on Machine Learning (ICML), 2024

2024

[12] [12]

MLLM-Bench: evaluating multimodal LLMs with per-sample criteria.arXiv preprint arXiv:2311.13951, 2024

Wentao Ge, Shunian Chen, Guiming Hardy Chen, et al. MLLM-Bench: evaluating multimodal LLMs with per-sample criteria.arXiv preprint arXiv:2311.13951, 2024. 10

work page arXiv 2024

[13] [13]

Md Tahmid Rahman Laskar, Mohammed Saidul Islam, Ridwan Mahbub, Ahmed Masry, Miza- nur Rahman, Amran Bhuiyan, Mir Tafseer Nayeem, Shafiq Joty, Enamul Hoque, and Jimmy Huang. Judging the judges: Can large vision-language models fairly evaluate chart compre- hension and reasoning? InProceedings of the 63rd Annual Meeting of the Association for Computational ...

2025

[14] [14]

Prometheus-vision: Vision-language model as a judge for fine-grained evaluation

Seongyun Lee, Seungone Kim, Sue Park, Geewook Kim, and Minjoon Seo. Prometheus-vision: Vision-language model as a judge for fine-grained evaluation. InFindings of the Association for Computational Linguistics: ACL 2024, pages 11286–11315, 2024

2024

[15] [15]

Human-Aligned MLLM judges for fine- grained image editing evaluation: a benchmark, framework, and analysis.arXiv preprint arXiv:2602.13028, 2026

Runzhou Liu, Hailey Weingord, Sejal Mittal, et al. Human-Aligned MLLM judges for fine- grained image editing evaluation: a benchmark, framework, and analysis.arXiv preprint arXiv:2602.13028, 2026

work page arXiv 2026

[16] [16]

A high-quality dataset and reliable evaluation for interleaved image-text generation.arXiv preprint arXiv:2506.09427, 2025

Yukang Feng, Jianwen Sun, Chuanhao Li, et al. A high-quality dataset and reliable evaluation for interleaved image-text generation.arXiv preprint arXiv:2506.09427, 2025

work page arXiv 2025

[17] [17]

Genarena: How can we achieve human-aligned evaluation for visual generation tasks?arXiv preprint arXiv:2602.06013, 2026

Ruihang Li, Leigang Qu, Jingxu Zhang, Dongnan Gui, Mengde Xu, Xiaosong Zhang, Han Hu, Wenjie Wang, and Jiaqi Wang. Genarena: How can we achieve human-aligned evaluation for visual generation tasks?arXiv preprint arXiv:2602.06013, 2026

work page arXiv 2026

[18] [18]

K-Sort eval: efficient preference evaluation for visual generation via corrected VLM-as-a-Judge

Zhikai Li, Jiatong Li, Xuewen Liu, et al. K-Sort eval: efficient preference evaluation for visual generation via corrected VLM-as-a-Judge. InInternational Conference on Learning Representations (ICLR), 2026

2026

[19] [19]

Llava-critic: Learning to evaluate multimodal models

Tianyi Xiong, Xiyao Wang, Dong Guo, Qinghao Ye, Haoqi Fan, Quanquan Gu, Heng Huang, and Chunyuan Li. Llava-critic: Learning to evaluate multimodal models. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 13618–13628, 2025

2025

[20] [20]

Advancing multimodal judge models through a capability-oriented benchmark and mcts-driven data generation.arXiv preprint arXiv:2603.00546, 2026

Zeyu Chen, Huanjin Yao, Ziwang Zhao, and Min Yang. Advancing multimodal judge models through a capability-oriented benchmark and mcts-driven data generation.arXiv preprint arXiv:2603.00546, 2026

work page arXiv 2026

[21] [21]

Multi-Crit: Benchmarking multimodal judges on pluralistic criteria-following.arXiv preprint arXiv:2511.21662, 2025

Tianyi Xiong, Yi Ge, Ming Li, et al. Multi-Crit: Benchmarking multimodal judges on pluralistic criteria-following.arXiv preprint arXiv:2511.21662, 2025

work page arXiv 2025

[22] [22]

CADSmith: Multi-Agent CAD Generation with Programmatic Geometric Validation,

Jesse Barkley, Rumi Loghmani, and Amir Barati Farimani. Cadsmith: Multi-agent cad genera- tion with programmatic geometric validation.arXiv preprint arXiv:2603.26512, 2026

work page arXiv 2026

[23] [24]

Codegen-3d: A benchmark for evaluating llms in zero-shot and iterative 3d modeling in blender.IEEE Access, 2026

Hao Ji, Kotha Aditya, Sebastian Escalante, and Yunjian Qiu. Codegen-3d: A benchmark for evaluating llms in zero-shot and iterative 3d modeling in blender.IEEE Access, 2026

2026

[24] [26]

EvoCAD: evolutionary CAD code generation with vision language models

Tobias Preintner, Weixuan Yuan, Adrian König, Thomas Bäck, Elena Raponi, and Niki Van Stein. EvoCAD: evolutionary CAD code generation with vision language models. In2025 IEEE 37th International Conference on Tools with Artificial Intelligence (ICTAI), pages 504–511. IEEE, 2025

2025

[25] [27]

Generating CAD code with vision-language models for 3D designs.arXiv preprint arXiv:2410.05340, 2024

Kamel Alrashedy, Pradyumna Tambwekar, Zulfiqar Zaidi, Megan Langwasser, Wei Xu, and Matthew Gombolay. Generating CAD code with vision-language models for 3D designs.arXiv preprint arXiv:2410.05340, 2024

work page arXiv 2024

[26] [28]

Pointer-CAD: Unifying B-Rep and Command Sequences via Pointer-based Edges & Faces Selection

Dacheng Qi, Chenyu Wang, Jingwei Xu, Tianzhe Chu, Zibo Zhao, Wen Liu, Wenrui Ding, Yi Ma, and Shenghua Gao. Pointer-CAD: unifying B-Rep and command sequences via pointer- based edges & faces selection.arXiv preprint arXiv:2603.04337, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[27] [29]

Text-to-CAD generation through infusing visual feedback in large language models.arXiv preprint arXiv:2501.19054, 2025

Ruiyu Wang, Yu Yuan, Shizhao Sun, and Jiang Bian. Text-to-CAD generation through infusing visual feedback in large language models.arXiv preprint arXiv:2501.19054, 2025. 11

work page arXiv 2025

[28] [30]

CAD- MLLM: Unifying multimodality-conditioned CAD generation with MLLM.arXiv preprint arXiv:2411.04954, 2024

Jingwei Xu, Chenyu Wang, Zibo Zhao, Wen Liu, Yi Ma, and Shenghua Gao. CAD- MLLM: Unifying multimodality-conditioned CAD generation with MLLM.arXiv preprint arXiv:2411.04954, 2024

work page arXiv 2024

[29] [31]

Automated CAD modeling sequence generation from text descriptions via transformer-based large language models

Jianxing Liao, Junyan Xu, Yatao Sun, Maowen Tang, Sicheng He, Jingxian Liao, Shui Yu, Yun Li, and Xiaohong Guan. Automated CAD modeling sequence generation from text descriptions via transformer-based large language models. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 21720–21748, 2025

2025

[30] [32]

Text2CAD: text to 3D CAD generation via technical drawings.arXiv preprint arXiv:2411.06206, 2024

Mohsen Yavartanoo, Sangmin Hong, Reyhaneh Neshatavar, and Kyoung Mu Lee. Text2CAD: text to 3D CAD generation via technical drawings.arXiv preprint arXiv:2411.06206, 2024

work page arXiv 2024

[31] [33]

FLASK: Fine-grained language model evaluation based on alignment skill sets

Seonghyeon Ye, Doyoung Kim, Sungdong Kim, Sahana Hwang, Seungone Kim, Yongrae Jo, James Thorne, Juho Kim, and Minjoon Seo. FLASK: Fine-grained language model evaluation based on alignment skill sets. InICLR, 2024

2024

[32] [34]

G-Eval: NLG evaluation using GPT-4 with better human alignment

Yang Liu, Dan Iter, Yichong Xu, Shuohang Wang, Ruochen Xu, and Chenguang Zhu. G-Eval: NLG evaluation using GPT-4 with better human alignment. InEMNLP, 2023

2023

[33] [35]

Keep assembly split unchanged

Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric P. Xing, Hao Zhang, Joseph E. Gonzalez, and Ion Stoica. Judging LLM-as-a-Judge with MT-Bench and chatbot arena. InNeurIPS Datasets and Benchmarks Track, 2023. A Engineering Knowledge Tables for Manufacturability To systematically eva...

2023

[34] [36]

‘<Task_Doc>‘: the design specification, including design goals, component list, parameter ranges, and assembly graph

[35] [37]

‘<Reference_Code>‘: the ground-truth CAD logic and spatial coordinates

[36] [38]

Core Focus

‘<Reference_SVG>‘: the visual anchor / reference image. # Output Template Strictly follow the Markdown template below. You must use the exact terms ‘<Reference_SVG>‘ and ‘<Generated_SVG>‘ in the rubric. Write the instructions as if you are directly guiding the downstream judge. Do not include a separate "Core Focus" field. Instead, merge all necessary ins...