pith. sign in

arxiv: 2606.01057 · v1 · pith:NUQJYLWBnew · submitted 2026-05-31 · 💻 cs.CV · cs.AI· cs.GR· cs.LG

3DCodeBench: Benchmarking Agentic Procedural 3D Modeling Via Code

Pith reviewed 2026-06-28 17:08 UTC · model grok-4.3

classification 💻 cs.CV cs.AIcs.GRcs.LG
keywords 3DCodeBenchprocedural 3D modelingvision-language modelscode generationbenchmark3DCodeArenaagentic procedural modeling
0
0 comments X

The pith

Vision-language models fail to generate procedural 3D modeling code primarily due to API mismatches and disconnected geometry.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper creates 3DCodeBench to measure how well vision-language models translate text and image prompts into code that builds 3D objects inside modeling software. It pairs this with 3DCodeArena, where humans rank the quality of the resulting 3D shapes through pairwise comparisons. Tests on twelve advanced models find that most errors happen when the generated code calls the wrong software functions. Even code that runs without errors tends to create 3D parts that do not connect properly or float apart. Giving the models more time to reason or letting them revise their code over several turns raises success rates. The results point to the need for better data on procedural code and for execution environments that give detailed feedback during refinement.

Core claim

3DCodeBench demonstrates that vision-language models acting as procedural 3D modelers encounter API mismatches as the main source of failure, while even successful code executions frequently yield 3D models with disconnected or floating components, and that test-time scaling through increased thinking budgets and multi-turn refinement enhances performance.

What carries the argument

3DCodeBench, the benchmark dataset of text/image prompts, procedural code, and 3D object triplets together with the 3DCodeArena human preference platform for assessing VLM-generated 3D models.

Load-bearing premise

The curated large-scale dataset of multimodal prompts, procedural code, and 3D object triplets is representative of real procedural 3D modeling tasks.

What would settle it

Running the best VLM agents from the benchmark in a live 3D modeling session and checking if the generated code produces usable assets faster or with less expert intervention than manual coding.

read the original abstract

Procedural 3D modeling through code is emerging as a versatile paradigm, offering deterministic, engine-ready, and precisely editable assets that neural 3D generators inherently lack. Authoring such procedural content, however, demands deep expertise in 3D software APIs, parametric design, and code-level geometric reasoning. In this paper, we propose 3DCodeBench, a systematic benchmark for evaluating vision-language model (VLM) agents for procedural 3D generation in 3D modeling software. Specifically, 3DCodeBench evaluates how effectively 12 advanced VLMs can serve as procedural 3D modelers by translating text and image references into procedural code for 3D modeling software. Recognizing that automated metrics may not fully capture the perceptual quality of 3D shapes, we build 3DCodeArena, a ranking platform based on pairwise human preferences over generated 3D outputs. From extensive evaluations and results, we observe that: (1) Failures mostly arise from API mismatches, while successful renders still suffer from disconnected or floating 3D geometric components. (2) Test-time scaling, such as higher thinking budgets and multi-turn refinement, improves performance overall. Our findings highlight a critical need for high-quality procedural coding data to advance commercial VLMs. Furthermore, effective procedural 3D modeling requires a robust execution environment that provides high-fidelity feedback for iterative refinement. We release 3DCodeBench, including the curated large-scale dataset of multimodal (text/image) prompts, procedural code, 3D object triplets, evaluation protocol, and the public 3DCodeArena platform as a foundational toolkit for exploring VLM-based procedural 3D modelers.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper introduces 3DCodeBench, a benchmark consisting of a large-scale curated dataset of multimodal (text/image) prompts paired with procedural code and 3D object ground truth, together with an evaluation protocol for VLMs acting as agents to generate executable 3D modeling code. It also presents 3DCodeArena, a human-preference ranking platform. Evaluations across 12 VLMs yield two main observations: failures are dominated by API mismatches while even successful renders exhibit disconnected or floating geometry, and test-time scaling (higher thinking budgets, multi-turn refinement) improves results. The dataset, protocol, and arena are released publicly.

Significance. If the benchmark's construction and evaluation protocol are shown to be representative, the work supplies a concrete failure taxonomy and scaling evidence that can guide VLM development for procedural 3D tasks. The public release of the multimodal prompt-code-3D triplet dataset, evaluation harness, and 3DCodeArena platform constitutes a reusable resource for the community.

major comments (3)
  1. [Abstract and §3] Abstract and §3 (dataset construction): the central observations on failure modes and test-time scaling rest on a 'curated large-scale dataset' whose prompt selection criteria, API coverage statistics, diversity quantification, and expert validation against real production logs are not described; without these the representativeness claim cannot be assessed.
  2. [§4] §4 (evaluation protocol): no details are supplied on statistical significance testing of the reported improvements, controls for prompt difficulty, or inter-prompt variance; this makes it impossible to determine whether the scaling benefits are robust or confounded by the particular prompt distribution.
  3. [§4.1] §4.1 (failure taxonomy): the assertion that 'failures mostly arise from API mismatches' is presented without quantitative breakdown (e.g., percentage of errors by category or per-API error rates), rendering the taxonomy non-reproducible and its implications for VLM training data unclear.
minor comments (2)
  1. [Introduction] The title uses 'Agentic' without a precise definition in the introduction; a short clarification of the agent loop (perception-action-execution) would help.
  2. [Figure 3] Figure captions for the 3DCodeArena interface should explicitly state the number of pairwise comparisons and the qualification criteria for raters.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their thorough review and valuable suggestions. We address each of the major comments below, indicating the revisions we plan to make to strengthen the manuscript.

read point-by-point responses
  1. Referee: [Abstract and §3] Abstract and §3 (dataset construction): the central observations on failure modes and test-time scaling rest on a 'curated large-scale dataset' whose prompt selection criteria, API coverage statistics, diversity quantification, and expert validation against real production logs are not described; without these the representativeness claim cannot be assessed.

    Authors: We agree that additional details on the dataset construction are necessary to substantiate the representativeness of the benchmark. In the revised manuscript, we will expand §3 with a dedicated subsection describing the prompt selection criteria, API coverage statistics (including a table of API usage frequencies), quantitative diversity metrics (e.g., coverage of object categories and complexity levels), and the expert validation process against real production logs from 3D modeling workflows. revision: yes

  2. Referee: [§4] §4 (evaluation protocol): no details are supplied on statistical significance testing of the reported improvements, controls for prompt difficulty, or inter-prompt variance; this makes it impossible to determine whether the scaling benefits are robust or confounded by the particular prompt distribution.

    Authors: We acknowledge the lack of statistical analysis in the evaluation protocol. The revised version of §4 will include statistical significance testing (e.g., paired t-tests or Wilcoxon tests with p-values), controls for prompt difficulty through stratified sampling or difficulty scoring, and reporting of inter-prompt variance (standard deviations across prompts). We will also add error bars to the figures. revision: yes

  3. Referee: [§4.1] §4.1 (failure taxonomy): the assertion that 'failures mostly arise from API mismatches' is presented without quantitative breakdown (e.g., percentage of errors by category or per-API error rates), rendering the taxonomy non-reproducible and its implications for VLM training data unclear.

    Authors: The original manuscript presented the failure taxonomy based on qualitative analysis of errors. We will revise §4.1 to include a quantitative breakdown, such as a table showing the percentage of errors by category (API mismatch, geometric issues, etc.) and per-API error rates, derived from our manual categorization of a sample of failures. This will make the taxonomy more reproducible and clarify implications for training data. revision: yes

Circularity Check

0 steps flagged

Empirical benchmark paper with no derivation chain or self-referential predictions

full rationale

The paper introduces 3DCodeBench as a new evaluation dataset for VLMs on procedural 3D code generation and reports direct empirical observations on failure modes (API mismatches, disconnected geometry) and test-time scaling benefits. No equations, fitted parameters, predictions, or uniqueness theorems appear in the provided text. Claims rest on released benchmark results rather than any reduction to self-citations or ansatzes. This is a standard non-circular empirical benchmark release.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The paper introduces an empirical benchmark rather than a mathematical derivation; no free parameters, domain axioms, or invented entities are required to support the central claim.

pith-pipeline@v0.9.1-grok · 5873 in / 1339 out tokens · 24116 ms · 2026-06-28T17:08:46.792934+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

57 extracted references · 12 canonical work pages · 8 internal anchors

  1. [1]

    Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

    Infinite Photorealistic Worlds Using Procedural Generation , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

  2. [2]

    Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

    Infinigen Indoors: Photorealistic Indoor Scenes Using Procedural Generation , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

  3. [3]

    arXiv preprint arXiv:2505.10755 , year=

    Procedural Generation of Articulated Simulation-Ready Assets , author=. arXiv preprint arXiv:2505.10755 , year=

  4. [4]

    arXiv preprint arXiv:1911.01911 , year=

    BlenderProc , author=. arXiv preprint arXiv:1911.01911 , year=

  5. [5]

    Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

    Kubric: A Scalable Dataset Generator , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

  6. [6]

    Deitke, Matt and VanderBilt, Eli and Herrasti, Alvaro and Weihs, Luca and Salvador, Jordi and Ehsani, Kiana and Han, Winson and Kolve, Eric and Farhadi, Ali and Kembhavi, Aniruddha and Mottaghi, Roozbeh , booktitle=

  7. [7]

    Sun, Chunyi and Han, Junlin and Deng, Weijian and Wang, Xinlong and Qin, Zishan and Gould, Stephen , journal=

  8. [8]

    and Schmid, Cordelia and Fathi, Alireza , booktitle=

    Hu, Ziniu and Iscen, Ahmet and Jain, Aashi and Kipf, Thomas and Yue, Yisong and Ross, David A. and Schmid, Cordelia and Fathi, Alireza , booktitle=

  9. [9]

    Lu, Sining and Chen, Guan and Dinh, Nam Anh and Lang, Itai and Holtzman, Ari and Hanocka, Rana , journal=

  10. [10]

    Holodeck: Language Guided Generation of 3D Embodied

    Yang, Yue and Sun, Fan-Yun and Weihs, Luca and VanderBilt, Eli and Herrasti, Alvaro and Han, Winson and Wu, Jiajun and Haber, Nick and Krishna, Ranjay and Liu, Lingjie and Callison-Burch, Chris and Yatskar, Mark and Kembhavi, Aniruddha and Clark, Christopher , booktitle=. Holodeck: Language Guided Generation of 3D Embodied

  11. [11]

    Chang, Angel X. and Funkhouser, Thomas and Guibas, Leonidas and Hanrahan, Pat and Huang, Qixing and Li, Zimo and Savarese, Silvio and Savva, Manolis and Song, Shuran and Su, Hao and Xiao, Jianxiong and Yi, Li and Yu, Fisher , journal=

  12. [12]

    and Savva, Manolis and Halber, Maciej and Funkhouser, Thomas and Nie

    Dai, Angela and Chang, Angel X. and Savva, Manolis and Halber, Maciej and Funkhouser, Thomas and Nie. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition , pages=

  13. [13]

    Yago and Dideriksen, Thomas and Arora, Himanshu and Guillaumin, Matthieu and Malik, Jitendra , booktitle=

    Collins, Jasmine and Goel, Shubham and Deng, Kenan and Luthra, Achleshwar and Xu, Leon and Gundogdu, Erhan and Zhang, Xi and Vicente, Tomas F. Yago and Dideriksen, Thomas and Arora, Himanshu and Guillaumin, Matthieu and Malik, Jitendra , booktitle=

  14. [14]

    Deitke, Matt and Schwenk, Dustin and Salvador, Jordi and Weihs, Luca and Michel, Oscar and VanderBilt, Eli and Schmidt, Ludwig and Ehsani, Kiana and Kembhavi, Aniruddha and Farhadi, Ali , booktitle=

  15. [15]

    Deitke, Matt and Liu, Ruoshi and Wallingford, Matthew and Ngo, Huong and Michel, Oscar and Kusupati, Aditya and Fan, Alan and Laforte, Christian and Voleti, Vikram and Gadre, Samir Yitzhak and VanderBilt, Eli and Kembhavi, Aniruddha and Vondrick, Carl and Gkioxari, Georgia and Ehsani, Kiana and Schmidt, Ludwig and Farhadi, Ali , booktitle=

  16. [16]

    Wu, Tong and Zhang, Jiarui and Fu, Xiao and Wang, Yuxin and Ren, Jiawei and Pan, Liang and Wu, Wayne and Yang, Lei and Wang, Jiaqi and Qian, Chen and Lin, Dahua and Liu, Ziwei , booktitle=

  17. [17]

    Advances in Neural Information Processing Systems , pages=

    Scalable 3D Captioning with Pretrained Models , author=. Advances in Neural Information Processing Systems , pages=

  18. [18]

    Zhou, Qingnan and Jacobson, Alec , journal=

  19. [19]

    Evaluating Large Language Models Trained on Code

    Evaluating Large Language Models Trained on Code , author=. arXiv preprint arXiv:2107.03374 , year=

  20. [20]

    Program Synthesis with Large Language Models

    Program Synthesis with Large Language Models , author=. arXiv preprint arXiv:2108.07732 , year=

  21. [21]

    Competition-Level Code Generation with

    Li, Yujia and Choi, David and Chung, Junyoung and Kushman, Nate and Schrittwieser, Julian and Leblond, R. Competition-Level Code Generation with. Science , volume=

  22. [22]

    The Stack: 3

    Kocetkov, Denis and Li, Raymond and Ben Allal, Loubna and Li, Jia and Mou, Chenghao and Mu. The Stack: 3. arXiv preprint arXiv:2211.15533 , year=

  23. [23]

    StarCoder: may the source be with you!

    Li, Raymond and Ben Allal, Loubna and Zi, Yangtian and Muennighoff, Niklas and Kocetkov, Denis and Mou, Chenghao and Marone, Marc and Akiki, Christopher and Li, Jia and Chim, Jenny and Liu, Qian and Zheltonozhskii, Evgenii and Zhuo, Terry Yue and Wang, Thomas and Dehaene, Olivier and Davaadorj, Mishig and Lamy-Poirier, Joel and Monteiro, Jo. arXiv preprin...

  24. [24]

    Rozi. Code. arXiv preprint arXiv:2308.12950 , year=

  25. [25]

    2023 , organization=

    Lai, Yuhang and Li, Chengxi and Wang, Yiming and Zhang, Tianyi and Zhong, Ruiqi and Zettlemoyer, Luke and Yih, Wen-tau and Fried, Daniel and Wang, Sida and Yu, Tao , booktitle=. 2023 , organization=

  26. [26]

    and Guha, Arjun and Greenberg, Michael and Jangda, Abhinav , journal=

    Cassano, Federico and Gouwar, John and Nguyen, Daniel and Nguyen, Sydney and Phipps-Costin, Luna and Pinckney, Donald and Yee, Ming-Ho and Zi, Yangtian and Anderson, Carolyn Jane and Feldman, Molly Q. and Guha, Arjun and Greenberg, Michael and Jangda, Abhinav , journal=

  27. [27]

    and Yang, John and Wettig, Alexander and Yao, Shunyu and Pei, Kexin and Press, Ofir and Narasimhan, Karthik , booktitle=

    Jimenez, Carlos E. and Yang, John and Wettig, Alexander and Yao, Shunyu and Pei, Kexin and Press, Ofir and Narasimhan, Karthik , booktitle=

  28. [28]

    and Wettig, Alexander and Lieret, Kilian and Yao, Shunyu and Narasimhan, Karthik and Press, Ofir , booktitle=

    Yang, John and Jimenez, Carlos E. and Wettig, Alexander and Lieret, Kilian and Yao, Shunyu and Narasimhan, Karthik and Press, Ofir , booktitle=

  29. [29]

    Teaching Large Language Models to Self-Debug

    Teaching Large Language Models to Self-Debug , author=. arXiv preprint arXiv:2304.05128 , year=

  30. [30]

    International Conference on Learning Representations , year=

    Hong, Sirui and Zhuge, Mingchen and Chen, Jiaqi and Zheng, Xiawu and Cheng, Yuheng and Zhang, Ceyao and Wang, Jinlin and Wang, Zili and Yau, Steven Ka Shing and Lin, Zijuan and Zhou, Liyang and Ran, Chenyu and Xiao, Lingfeng and Wu, Chenglin and Schmidhuber, J. International Conference on Learning Representations , year=

  31. [31]

    Qian, Chen and Liu, Wei and Liu, Hongzhang and Chen, Nuo and Dang, Yufan and Li, Jiahao and Yang, Cheng and Chen, Weize and Su, Yusheng and Cong, Xin and Xu, Juyuan and Li, Dahai and Liu, Zhiyuan and Sun, Maosong , booktitle=

  32. [32]

    and Tang, Xiangru and Zhuge, Mingchen and Pan, Jiayi and Song, Yueqi and Li, Bowen and Singh, Jaskirat and Tran, Hoang H

    Wang, Xingyao and Li, Boxuan and Song, Yufan and Xu, Frank F. and Tang, Xiangru and Zhuge, Mingchen and Pan, Jiayi and Song, Yueqi and Li, Bowen and Singh, Jaskirat and Tran, Hoang H. and Li, Fuqiang and Ma, Ren and Zheng, Mingzhang and Qian, Bill and Shao, Yanjun and Muennighoff, Niklas and Zhang, Yizhe and Hui, Binyuan and Lin, Junyang and Brennan, Robe...

  33. [33]

    Kenny and Barton, Theresa and Xu, Xianghao and Wang, Kai and Jiang, Ellen and Guerrero, Paul and Mitra, Niloy J

    Jones, R. Kenny and Barton, Theresa and Xu, Xianghao and Wang, Kai and Jiang, Ellen and Guerrero, Paul and Mitra, Niloy J. and Ritchie, Daniel , journal=

  34. [34]

    Kenny and Guerrero, Paul and Mitra, Niloy J

    Jones, R. Kenny and Guerrero, Paul and Mitra, Niloy J. and Ritchie, Daniel , journal=

  35. [35]

    Sharma, Gopal and Goyal, Rishabh and Liu, Difan and Kalogerakis, Evangelos and Maji, Subhransu , booktitle=

  36. [36]

    Hong, Yining and Zhen, Haoyu and Chen, Peihao and Zheng, Shuhong and Du, Yilun and Chen, Zhenfang and Gan, Chuang , booktitle=

  37. [37]

    2024 , organization=

    Avetisyan, Armen and Xie, Christopher and Howard-Jenkins, Henry and Yang, Tsun-Yi and Aroudj, Samir and Patra, Suvam and Zhang, Fuyang and Frost, Duncan and Holland, Luke and Orme, Campbell and Engel, Jakob and Miller, Edward and Newcombe, Richard and Balntas, Vasileios , booktitle=. 2024 , organization=

  38. [38]

    2026 , howpublished=

  39. [39]

    Gu, Yunqi and Huang, Ian and Je, Jihyeon and Yang, Guandao and Guibas, Leonidas , booktitle=

  40. [40]

    Zheng, Yan and Bordes, Florian , journal=

  41. [41]

    Ahuja, Siddharth , year=

  42. [42]

    2025 , howpublished=

  43. [43]

    and Darrell, Trevor and Kanazawa, Angjoo and Feng, Haiwen , journal=

    Yin, Shaofeng and Ge, Jiaxin and Wang, Zora Zhiruo and Wang, Chenyang and Li, Xiuyu and Black, Michael J. and Darrell, Trevor and Kanazawa, Angjoo and Feng, Haiwen , journal=

  44. [44]

    Ling, Lu and Lin, Chen-Hsuan and Lin, Tsung-Yi and Ding, Yifan and Zeng, Yu and Sheng, Yichen and Ge, Yunhao and Liu, Ming-Yu and Bera, Aniket and Li, Zhaoshuo , journal=

  45. [45]

    SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features

    Tschannen, Michael and Gritsenko, Alexey and Wang, Xiao and Naeem, Muhammad Ferjad and Alabdulmohsin, Ibrahim and Parthasarathy, Nikhil and Evans, Talfan and Beyer, Lucas and Xia, Ye and Mustafa, Basil and H. arXiv preprint arXiv:2502.14786 , year=

  46. [46]

    DINOv3

    Sim. arXiv preprint arXiv:2508.10104 , year=

  47. [47]

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition , pages=

    A Point Set Generation Network for 3D Object Reconstruction from a Single Image , author=. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition , pages=

  48. [48]

    Zhou, Junsheng and Wang, Jinsheng and Ma, Baorui and Liu, Yu-Shen and Huang, Tiejun and Wang, Xinlong , booktitle=

  49. [49]

    2025 , howpublished=

    Introducing. 2025 , howpublished=

  50. [50]

    Zhang, Yuhan and Zhang, Mengchen and Wu, Tong and Wang, Tengfei and Wetzstein, Gordon and Lin, Dahua and Liu, Ziwei , journal=

  51. [51]

    Sun, Fan-Yun and Wu, Shengguang and Jacobsen, Christian and Yim, Thomas and Zou, Haoming and Zook, Alex and Li, Shangru and Chou, Yu-Hsin and Can, Ethem and Wu, Xunlei and Eppner, Clemens and Blukis, Valts and Tremblay, Jonathan and Wu, Jiajun and Birchfield, Stan and Haber, Nick , journal=

  52. [52]

    Generating

    Alrashedy, Kamel and Tambwekar, Pradyumna and Zaidi, Zulfiqar and Langwasser, Megan and Xu, Wei and Gombolay, Matthew , journal=. Generating

  53. [53]

    arXiv preprint arXiv:2601.12234 , year=

    Proc3D: Procedural 3D Generation and Parametric Editing of 3D Shapes with Large Language Models , author=. arXiv preprint arXiv:2601.12234 , year=

  54. [54]

    Du, Yuhao and Chen, Shunian and Zan, Wenbo and Li, Peizhao and Wang, Mingxuan and Song, Dingjie and Li, Bo and Hu, Yan and Wang, Benyou , journal=

  55. [55]

    and Gonzalez, Joseph E

    Chiang, Wei-Lin and Zheng, Lianmin and Sheng, Ying and Angelopoulos, Anastasios Nikolas and Li, Tianle and Li, Dacheng and Zhang, Hao and Zhu, Banghua and Jordan, Michael I. and Gonzalez, Joseph E. and Stoica, Ion , booktitle=. Chatbot Arena: An Open Platform for Evaluating. 2024 , url=

  56. [56]

    Articraft: An Agentic System for Scalable Articulated 3D Asset Generation

    Articraft: An Agentic System for Scalable Articulated 3D Asset Generation , author=. arXiv preprint arXiv:2605.15187 , year=

  57. [57]

    International Conference on Learning Representations , volume=

    Articulate-anything: Automatic modeling of articulated objects via a vision-language foundation model , author=. International Conference on Learning Representations , volume=