ComAct: Reframing Professional Software Manipulation via COM-as-Action Paradigm

Botian Shi; Daocheng Fu; Hairong Zhang; Hongbin Zhou; Jiaxin Ai; Kaipeng Zhang; Licheng Wen; Nianchen Deng; Pinlong Cai; Shu Zou

arxiv: 2606.13239 · v2 · pith:HTG33W7Knew · submitted 2026-06-11 · 💻 cs.SE · cs.AI· cs.CL· cs.CV

ComAct: Reframing Professional Software Manipulation via COM-as-Action Paradigm

Jiaxin Ai , Tao Hu , Xuemeng Yang , Shu Zou , Hairong Zhang , Daocheng Fu , Yu Yang , Hongbin Zhou

show 6 more authors

Nianchen Deng Pinlong Cai Zhongyuan Wang Botian Shi Kaipeng Zhang Licheng Wen

This is my paper

Pith reviewed 2026-07-01 07:43 UTC · model grok-4.3

classification 💻 cs.SE cs.AIcs.CLcs.CV

keywords software agentsCAD softwareprogram synthesisComponent Object ModelGUI agentsagent benchmarksdeterministic control

0 comments

The pith

Treating Component Object Model calls as actions lets agents control industrial CAD software through deterministic program synthesis instead of visual clicks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper claims that GUI-based agents fail at professional software due to fragile visual grounding and error buildup over long sequences, while API-based methods hit barriers from inconsistent protocols and closed interfaces. It proposes the COM-as-Action paradigm, which turns software manipulation into the synthesis of executable programs using the Component Object Model as a single reliable interface. This is tested through the new ComCADBench benchmark for real CAD tools, where a three-stage trained agent called ComActor reaches high success rates and holds up on extended tasks that defeat other approaches. The work also introduces a container-based training platform to scale the method.

Core claim

The central claim is that the Component Object Model supplies a unified executable abstraction for professional software, so interaction can be reframed as deterministic program synthesis. Under this COM-as-Action view, the ComActor agent, trained progressively across three stages on ComForge, attains state-of-the-art results on ComCADBench while showing resilience in long-horizon scenarios where GUI baselines fall to near-zero success and also transfers to an external CAD benchmark.

What carries the argument

The COM-as-Action paradigm, which converts professional software manipulation into deterministic program synthesis by issuing Component Object Model calls directly.

If this is right

Frontier models reach near-zero success with GUI interaction but obtain immediate gains once switched to COM execution.
ComActor maintains performance across long sequences where baseline agents collapse.
The trained agent transfers to an external CAD benchmark beyond the training distribution.
A container platform enables large-scale data collection and training for this style of agent.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same direct-interface approach could be tried on other Windows professional applications that expose COM, such as office or engineering suites.
Reducing reliance on visual grounding might lower cumulative error rates in any long-running agent workflow that currently uses screen observations.
Combining COM synthesis with selective API calls could create hybrid agents that handle both open and closed software environments.

Load-bearing premise

The Component Object Model must supply a single, accessible executable interface for real industrial CAD software that supports reliable program synthesis without needing visual interpretation or varied protocols.

What would settle it

Test whether success rates for COM-based agents remain high when applied to a commercial CAD package whose interfaces are not exposed through the Component Object Model.

Figures

Figures reproduced from arXiv: 2606.13239 by Botian Shi, Daocheng Fu, Hairong Zhang, Hongbin Zhou, Jiaxin Ai, Kaipeng Zhang, Licheng Wen, Nianchen Deng, Pinlong Cai, Shu Zou, Tao Hu, Xuemeng Yang, Yu Yang, Zhongyuan Wang.

**Figure 2.** Figure 2: Overview of our ComAct framework, consisting of three components: a data construction pipeline that [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: ComCADBench covers 3 CAD platforms, 7 engineering activities, and supports long-horizon cross [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗

**Figure 4.** Figure 4: An execution trajectory of our agent completing a multi-task pipeline (modeling and engineering drawing). [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

**Figure 5.** Figure 5: Visualization of the ground truth artifacts for 3d modeling samples in ComCADBench. [PITH_FULL_IMAGE:figures/full_fig_p017_5.png] view at source ↗

**Figure 6.** Figure 6: Visualization of the ground truth artifacts for 2d sketching samples in ComCADBench. [PITH_FULL_IMAGE:figures/full_fig_p017_6.png] view at source ↗

**Figure 7.** Figure 7: Visualization of the ground truth artifacts for assembly samples in ComCADBench. [PITH_FULL_IMAGE:figures/full_fig_p021_7.png] view at source ↗

**Figure 8.** Figure 8: Detailed examples of input instructions across all specific task categories in ComCADBench. [PITH_FULL_IMAGE:figures/full_fig_p021_8.png] view at source ↗

read the original abstract

Existing computer-use agents remain fundamentally limited in professional software manipulation: GUI-based agents suffer from fragile visual grounding and long-horizon error accumulation, while API-basedapproaches struggle with heterogeneous protocols and inaccessible commercial interfaces. In this work,we identify the Component Object Model (COM) as a unified executable abstraction, proposing COM-as-Action: a new paradigm that reframes professional software interaction as deterministic program synthesisrather than sequential visual control. To validate this paradigm in the most demanding environments, weintroduce ComCADBench, the first benchmark for agents operating real industrial CAD software. Ourexperiments reveal a substantial paradigm gap: frontier proprietary models achieve near-zero successunder GUI-based interaction, whereas COM-based execution yields substantial immediate gains. Tobridge the remaining gap between syntactic correctness and geometric accuracy, we develop ComActor, aself-correcting agent trained through a progressive three-stage framework, alongside ComForge, a scalableplatform for large-scale training in Windows containers. Extensive experiments show that ComActorachieves state-of-the-art performance on ComCADBench, with strong resilience in long-horizon taskswhere baselines collapse, and generalizes to external CAD benchmark.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

COM-as-Action is a reasonable idea for CAD control but the paper's claims rest on an unexamined assumption about COM interface availability.

read the letter

The paper's main move is to treat COM calls as the primitive actions for an agent instead of GUI events or API calls. This leads to ComCADBench as a new test set for real CAD packages and to ComActor, which uses a three-stage self-correction loop plus the ComForge training platform.

What stands out is the direct comparison showing GUI agents near zero on the benchmark while COM-based runs improve immediately. The long-horizon resilience claim is the part that would matter most if it holds.

The soft spot is exactly the one in the stress-test note. The abstract gives no information on how the COM interfaces were discovered, whether they are documented, or how complete they are for the tasks in ComCADBench. If the authors had to reverse-engineer entry points or if large parts of the needed functionality are missing, the advantage over heterogeneous APIs shrinks to the same accessibility problem the paper criticizes. Without that evidence the central claim stays unverified.

The work is aimed at people building agents that must operate inside closed commercial desktop tools. A reader already working on computer-use agents for engineering software would find the benchmark and the COM framing worth looking at.

I would send it to peer review. The idea is concrete enough that referees can check the COM coverage and the experimental details directly.

Referee Report

1 major / 2 minor

Summary. The manuscript proposes reframing professional software manipulation (focusing on industrial CAD) as COM-as-Action, treating the Component Object Model as a unified deterministic executable abstraction. It introduces ComCADBench as the first benchmark for agents on real CAD software, claims near-zero GUI success versus substantial COM gains for frontier models, and presents ComActor (a self-correcting agent via three-stage training) plus ComForge (a scalable Windows-container training platform). Experiments reportedly show SOTA performance on ComCADBench with resilience in long-horizon tasks and generalization to an external CAD benchmark.

Significance. If the core accessibility and performance claims hold with reproducible evidence, the work could meaningfully advance reliable agentic control of professional tools by shifting from fragile visual or heterogeneous API methods to programmatic synthesis. The new benchmark and training platform would constitute concrete contributions to the field.

major comments (1)

[Abstract] Abstract (paragraph 2): The central claim of a 'substantial paradigm gap' (near-zero GUI success vs. substantial COM gains) and the superiority of the COM-as-Action paradigm rests on the unverified assumption that COM supplies a unified, deterministic, and practically accessible executable abstraction for commercial CAD packages. No details are provided on interface discovery, documentation status, or completeness of exposed functionality for the specific CAD software, which is load-bearing; if access relies on undocumented or reverse-engineered entry points, the claimed advantage over API approaches collapses.

minor comments (2)

[Abstract] Abstract: Typo 'API-basedapproaches' (missing space).
[Abstract] Abstract: The terms ComActor, ComCADBench, and ComForge are introduced without prior definition or citation to prior work, which may confuse readers unfamiliar with the contributions.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their detailed and constructive feedback, which helps strengthen the clarity and rigor of our claims regarding the COM-as-Action paradigm. We address the single major comment point-by-point below.

read point-by-point responses

Referee: [Abstract] Abstract (paragraph 2): The central claim of a 'substantial paradigm gap' (near-zero GUI success vs. substantial COM gains) and the superiority of the COM-as-Action paradigm rests on the unverified assumption that COM supplies a unified, deterministic, and practically accessible executable abstraction for commercial CAD packages. No details are provided on interface discovery, documentation status, or completeness of exposed functionality for the specific CAD software, which is load-bearing; if access relies on undocumented or reverse-engineered entry points, the claimed advantage over API approaches collapses.

Authors: We agree that additional transparency on COM interface access is warranted to support the paradigm claims. In the revised manuscript we will add a dedicated subsection (likely in Section 3 or 4) detailing: (1) the specific CAD packages used in ComCADBench and their official COM exposure via vendor-published type libraries; (2) the standard discovery mechanism using COM's ITypeLib/ITypeInfo interfaces and registry-based ProgID lookup, which does not rely on reverse engineering; (3) references to publicly available vendor documentation (e.g., SolidWorks and AutoCAD API references) confirming that the core geometric and modeling operations exercised by the benchmark are part of the documented, stable COM surface; and (4) a brief completeness analysis showing that the benchmark tasks map to documented methods rather than undocumented internals. This clarification will be reflected in an updated abstract paragraph as well. These additions directly address the load-bearing assumption without altering the experimental results. revision: yes

Circularity Check

0 steps flagged

No circularity: paradigm proposal and benchmark results are independent of inputs

full rationale

The paper presents a conceptual reframing (COM-as-Action) and an empirical benchmark (ComCADBench) with agent experiments. No equations, fitted parameters, predictions derived from prior fits, or self-citation chains appear in the provided text. The core assumption that COM supplies a unified executable interface is stated as an identification rather than derived from any prior result or self-referential construction. Experimental claims (near-zero GUI success vs. COM gains) rest on reported benchmark outcomes, which are falsifiable externally and do not reduce to the assumption by definition. This is a standard non-circular empirical paper.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 3 invented entities

Abstract-only review; no free parameters, mathematical axioms, or independently evidenced entities can be extracted. New named components are introduced but lack supporting details or external validation.

invented entities (3)

ComActor no independent evidence
purpose: self-correcting agent trained via three-stage framework for COM-based CAD interaction
Introduced in abstract as achieving SOTA with resilience in long-horizon tasks
ComCADBench no independent evidence
purpose: benchmark for agents operating real industrial CAD software
Claimed as first benchmark for validating the paradigm
ComForge no independent evidence
purpose: scalable platform for large-scale training in Windows containers
Developed to support agent training

pith-pipeline@v0.9.1-grok · 5781 in / 1446 out tokens · 42370 ms · 2026-07-01T07:43:16.821523+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Reference graph

Works this paper leans on

44 extracted references · 13 canonical work pages · 6 internal anchors

[1]

Developing a computer use model

Anthropic. Developing a computer use model. https://www.anthropic.com/news/ developing-computer-use, October 2024. Accessed: 2025-03-25

2024
[2]

Computer-using agent

OpenAI. Computer-using agent. https://openai.com/index/computer-using-agent/, January 2025. Accessed: 2025-03-25

2025
[3]

Toolllm: Facilitating large language models to master 16000+ real-world apis

Yujia Qin, Shihao Liang, Yining Ye, Kunlun Zhu, Lan Yan, Yaxi Lu, Yankai Lin, Xin Cong, Xiangru Tang, Bill Qian, et al. Toolllm: Facilitating large language models to master 16000+ real-world apis. InInternational Conference on Learning Representations, volume 2024, pages 9695–9717, 2024

2024
[4]

Screenspot-pro: Gui grounding for professional high-resolution computer use

Kaixin Li, Ziyang Meng, Hongzhan Lin, Ziyang Luo, Yuchen Tian, Jing Ma, Zhiyong Huang, and Tat-Seng Chua. Screenspot-pro: Gui grounding for professional high-resolution computer use. InProceedings of the 33rd ACM International Conference on Multimedia, pages 8778–8786, 2025

2025
[5]

OpenAI GPT-5 System Card

Aaditya Singh, Adam Fry, Adam Perelman, Adam Tart, Adi Ganesh, Ahmed El-Kishky, Aidan McLaughlin, Aiden Low, AJ Ostrow, Akhila Ananthram, et al. Openai gpt-5 system card.arXiv preprint arXiv:2601.03267, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[6]

Swe-bench: Can language models resolve real-world github issues? InInternational Conference on Learning Representations, volume 2024, pages 54107–54157, 2024

Carlos E Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan. Swe-bench: Can language models resolve real-world github issues? InInternational Conference on Learning Representations, volume 2024, pages 54107–54157, 2024

2024
[7]

Livecodebench: Holistic and contamination free evaluation of large language models for code

Naman Jain, Alex Gu, Wen-Ding Li, Fanjia Yan, Tianjun Zhang, Sida Wang, Armando Solar-Lezama, Koushik Sen, and Ion Stoica. Livecodebench: Holistic and contamination free evaluation of large language models for code. InInternational Conference on Learning Representations, volume 2025, pages 58791–58831, 2025

2025
[8]

Component Object Model (COM)

Microsoft. Component Object Model (COM). https://learn.microsoft.com/en-us/windows/win32/ com/component-object-model--com--portal, 2024. Accessed: 2025-05-26

2024
[9]

Osworld: Benchmarking multimodal agents for open-ended tasks in real computer environments.Advances in Neural Information Processing Systems, 37:52040–52094, 2024

Tianbao Xie, Danyang Zhang, Jixuan Chen, Xiaochuan Li, Siheng Zhao, Ruisheng Cao, Toh J Hua, Zhoujun Cheng, Dongchan Shin, Fangyu Lei, et al. Osworld: Benchmarking multimodal agents for open-ended tasks in real computer environments.Advances in Neural Information Processing Systems, 37:52040–52094, 2024

2024
[10]

Seeclick: Harnessing gui grounding for advanced visual gui agents

Kanzhi Cheng, Qiushi Sun, Yougang Chu, Fangzhi Xu, Li YanTao, Jianbing Zhang, and Zhiyong Wu. Seeclick: Harnessing gui grounding for advanced visual gui agents. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 9313–9332, 2024

2024
[11]

arXiv preprint arXiv:2410.08164 (2024)

Saaket Agashe, Jiuzhou Han, Shuyu Gan, Jiachen Yang, Ang Li, and Xin Eric Wang. Agent s: An open agentic framework that uses computers like a human.arXiv preprint arXiv:2410.08164, 2024

work page arXiv 2024
[12]

UI-TARS: Pioneering Automated GUI Interaction with Native Agents

Yujia Qin, Yining Ye, Junjie Fang, Haoming Wang, Shihao Liang, Shizuo Tian, Junda Zhang, Jiahao Li, Yunxin Li, Shijue Huang, et al. Ui-tars: Pioneering automated gui interaction with native agents.arXiv preprint arXiv:2501.12326, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[13]

Ufo3: Weaving the digital agent galaxy.arXiv preprint arXiv:2511.11332, 2025

Chaoyun Zhang, Liqun Li, He Huang, Chiming Ni, Bo Qiao, Si Qin, Yu Kang, Minghua Ma, Qingwei Lin, Saravan Rajmohan, et al. Ufo3: Weaving the digital agent galaxy.arXiv preprint arXiv:2511.11332, 2025

work page arXiv 2025
[14]

Cogagent: A visual language model for gui agents

Wenyi Hong, Weihan Wang, Qingsong Lv, Jiazheng Xu, Wenmeng Yu, Junhui Ji, Yan Wang, Zihan Wang, Yuxiao Dong, Ming Ding, et al. Cogagent: A visual language model for gui agents. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 14281–14290, 2024

2024
[15]

Appagent: Multimodal agents as smartphone users

Chi Zhang, Zhao Yang, Jiaxuan Liu, Yanda Li, Yucheng Han, Xin Chen, Zebiao Huang, Bin Fu, and Gang Yu. Appagent: Multimodal agents as smartphone users. InProceedings of the 2025 CHI Conference on Human Factors in Computing Systems, pages 1–20, 2025

2025
[16]

Aria-ui: Visual grounding for gui instructions

Yuhao Yang, Yue Wang, Dongxu Li, Ziyang Luo, Bei Chen, Chao Huang, and Junnan Li. Aria-ui: Visual grounding for gui instructions. InFindings of the Association for Computational Linguistics: ACL 2025, pages 22418–22433, 2025

2025
[17]

GPT-4V(ision) is a Generalist Web Agent, if Grounded

Boyuan Zheng, Boyu Gou, Jihyung Kil, Huan Sun, and Yu Su. Gpt-4v (ision) is a generalist web agent, if grounded.arXiv preprint arXiv:2401.01614, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[18]

Large Language Model-Brained GUI Agents: A Survey

Chaoyun Zhang, Shilin He, Jiaxu Qian, Bowen Li, Liqun Li, Si Qin, Yu Kang, Minghua Ma, Guyue Liu, Qingwei Lin, et al. Large language model-brained gui agents: A survey.arXiv preprint arXiv:2411.18279, 2024. 10

work page internal anchor Pith review Pith/arXiv arXiv 2024
[19]

Os agents: A survey on mllm-based agents for computer, phone and browser use

Xueyu Hu, Tao Xiong, Biao Yi, Zishu Wei, Ruixuan Xiao, Yurun Chen, Jiasheng Ye, Meiling Tao, Xiangxin Zhou, Ziyu Zhao, et al. Os agents: A survey on mllm-based agents for computer, phone and browser use. In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 7436–7465, 2025

2025
[20]

Beyond browsing: Api-based web agents

Yueqi Song, Frank F Xu, Shuyan Zhou, and Graham Neubig. Beyond browsing: Api-based web agents. In Findings of the Association for Computational Linguistics: ACL 2025, pages 11066–11085, 2025

2025
[21]

Autowebglm: A large language model-based web navigating agent

Hanyu Lai, Xiao Liu, Iat Long Iong, Shuntian Yao, Yuxuan Chen, Pengbo Shen, Hao Yu, Hanchen Zhang, Xiaohan Zhang, Yuxiao Dong, et al. Autowebglm: A large language model-based web navigating agent. InProceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, pages 5295–5306, 2024

2024
[22]

Os-copilot: Towards generalist computer agents with self-improvement, 2024.URL https://arxiv

Zhiyong Wu, Chengcheng Han, Zichen Ding, Zhenmin Weng, Zhoumianze Liu, Shunyu Yao, Tao Yu, and Lingpeng Kong. Os-copilot: Towards generalist computer agents with self-improvement, 2024.URL https://arxiv. org/abs/2402.07456

work page arXiv 2024
[23]

Executable code actions elicit better llm agents

Xingyao Wang, Yangyi Chen, Lifan Yuan, Yizhe Zhang, Yunzhu Li, Hao Peng, and Heng Ji. Executable code actions elicit better llm agents. InForty-first International Conference on Machine Learning, 2024

2024
[24]

Coact-1: Computer-using agents with coding as actions.arXiv preprint arXiv:2508.03923, 2025

Linxin Song, Yutong Dai, Viraj Prabhu, Jieyu Zhang, Taiwei Shi, Li Li, Junnan Li, Silvio Savarese, Zeyuan Chen, Jieyu Zhao, et al. Coact-1: Computer-using agents with coding as actions.arXiv preprint arXiv:2508.03923, 2025

work page arXiv 2025
[25]

Sketchgraphs: A large-scale dataset for modeling relational geometry in computer-aided design.arXiv preprint arXiv:2007.08506, 2020

Ari Seff, Yaniv Ovadia, Wenda Zhou, and Ryan P Adams. Sketchgraphs: A large-scale dataset for modeling relational geometry in computer-aided design.arXiv preprint arXiv:2007.08506, 2020

work page arXiv 2007
[26]

Fusion 360 gallery: A dataset and environment for programmatic cad construction from human design sequences.ACM Transactions on Graphics (TOG), 40(4):1–24, 2021

Karl DD Willis, Yewen Pu, Jieliang Luo, Hang Chu, Tao Du, Joseph G Lambourne, Armando Solar-Lezama, and Wojciech Matusik. Fusion 360 gallery: A dataset and environment for programmatic cad construction from human design sequences.ACM Transactions on Graphics (TOG), 40(4):1–24, 2021

2021
[27]

Transcad: A hierarchical transformer for cad sequence inference from point clouds

Elona Dupont, Kseniya Cherenkova, Dimitrios Mallis, Gleb Gusev, Anis Kacem, and Djamila Aouada. Transcad: A hierarchical transformer for cad sequence inference from point clouds. InEuropean Conference on Computer Vision, pages 19–36. Springer, 2024

2024
[28]

Cad-llama: leveraging large language models for computer-aided design parametric 3d model generation

Jiahao Li, Weijian Ma, Xueyang Li, Yunzhong Lou, Guichun Zhou, and Xiangdong Zhou. Cad-llama: leveraging large language models for computer-aided design parametric 3d model generation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18563–18573, 2025

2025
[29]

Flexcad: Unified and versatile controllable cad generation with fine-tuned large language models

Zhanwei Zhang, Shizhao Sun, Wenxiao Wang, Deng Cai, and Jiang Bian. Flexcad: Unified and versatile controllable cad generation with fine-tuned large language models. InInternational Conference on Learning Representations, volume 2025, pages 3204–3227, 2025

2025
[30]

Deepcad: A deep generative network for computer-aided design models

Rundi Wu, Chang Xiao, and Changxi Zheng. Deepcad: A deep generative network for computer-aided design models. InProceedings of the IEEE/CVF international conference on computer vision, pages 6772–6782, 2021

2021
[31]

Text2cad: Generating sequential cad designs from beginner-to-expert level text prompts.Advances in Neural Information Processing Systems, 37:7552–7579, 2024

Mohammad S Khan, Sankalp Sinha, Talha U Sheikh, Didier Stricker, Sk A Ali, and Muhammad Z Afzal. Text2cad: Generating sequential cad designs from beginner-to-expert level text prompts.Advances in Neural Information Processing Systems, 37:7552–7579, 2024

2024
[32]

Cad-coder: Text-to-cad generation with chain-of-thought and geometric reward.Advances in Neural Information Processing Systems, 38:59765–59789, 2026

Yandong Guan, Xilin Wang, Ximing Xing, Jing Zhang, Dong Xu, and Qian Yu. Cad-coder: Text-to-cad generation with chain-of-thought and geometric reward.Advances in Neural Information Processing Systems, 38:59765–59789, 2026

2026
[33]

Cad-recode: Reverse engineering cad code from point clouds

Danila Rukhovich, Elona Dupont, Dimitrios Mallis, Kseniya Cherenkova, Anis Kacem, and Djamila Aouada. Cad-recode: Reverse engineering cad code from point clouds. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 9801–9811, 2025

2025
[34]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models. arXiv preprint arXiv:2402.03300, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[35]

Cadmium: Fine-tuning code language models for text-driven sequential cad design.arXiv preprint arXiv:2507.09792, 2025

Prashant Govindarajan, Davide Baldelli, Jay Pathak, Quentin Fournier, and Sarath Chandar. Cadmium: Fine-tuning code language models for text-driven sequential cad design.arXiv preprint arXiv:2507.09792, 2025. 11

work page arXiv 2025
[36]

Qwen Team. Qwen3.5. https://qwenlm.github.io/blog/qwen3.5/, February 2026. Accessed: 2026- 05-26

2026
[37]

Claude Sonnet 4.6 system card

Anthropic. Claude Sonnet 4.6 system card. Technical report, Anthropic, February 2026

2026
[38]

GLM-4.5: Agentic, Reasoning, and Coding (ARC) Foundation Models

Aohan Zeng, Xin Lv, Qinkai Zheng, Zhenyu Hou, Bin Chen, Chengxing Xie, Cunxiang Wang, Da Yin, Hao Zeng, Jiajie Zhang, et al. Glm-4.5: Agentic, reasoning, and coding (arc) foundation models.arXiv preprint arXiv:2508.06471, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[39]

Cad-judge: Toward efficient morphological grading and verification for text-to-cad generation

Zheyuan Zhou, Jiayi Han, Liang Du, Naiyu Fang, Lemiao Qiu, and Shuyou Zhang. Cad-judge: Toward efficient morphological grading and verification for text-to-cad generation. InICASSP 2026-2026 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1021–1025. IEEE, 2026

2026
[40]

Gpt-4o system card

OpenAI. Gpt-4o system card. https://cdn.openai.com/gpt-4o-system-card.pdf , 2024. Accessed: 2024-09-26. 12 Appendix Contents A Data Construction 13 A.1 Source Data and Textualization . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 A.2 Ground Truth COM Script Synthesis . . . . . . . . . . . . . . . . . . . . . . . . . 13 A.3 Downstream Multi-Tas...

2024
[41]

sketch_id

SFT, we apply Low-Rank Adaptation (LoRA) to all linear layers with a rank of r= 8 and α= 32 . The models are trained using the AdamW optimizer with a learning rate of 1e-5, a cosine learning rate scheduler, and a warmup ratio of 0.05. To accommodate the extensive context required for code generation and error tracebacks, the maximum sequence length is set...

work page arXiv 1961
[42]

Brief reasoning inside<thinking>...</thinking>
[43]

A high-level decision wrapped as:“‘decision CODE (or DONE/FAIL) “‘
[44]

‘ RAG Prompt (Appended to Baseline) External Knowledge Context: Here are some COM APIs that might be useful for completing this task. [ {

If and only if the decision is CODE, output a single“‘python ... “‘block. Few-Shot Prompt (Appended to Baseline) Example: 3D Modeling in Solidworks Task Instruction:Model this part in Solidworks: To construct the first part of the cylinder...[Detailed dimensions and constraints omitted for brevity]...export the model as an STL and STEP file. Output: <thin...

[1] [1]

Developing a computer use model

Anthropic. Developing a computer use model. https://www.anthropic.com/news/ developing-computer-use, October 2024. Accessed: 2025-03-25

2024

[2] [2]

Computer-using agent

OpenAI. Computer-using agent. https://openai.com/index/computer-using-agent/, January 2025. Accessed: 2025-03-25

2025

[3] [3]

Toolllm: Facilitating large language models to master 16000+ real-world apis

Yujia Qin, Shihao Liang, Yining Ye, Kunlun Zhu, Lan Yan, Yaxi Lu, Yankai Lin, Xin Cong, Xiangru Tang, Bill Qian, et al. Toolllm: Facilitating large language models to master 16000+ real-world apis. InInternational Conference on Learning Representations, volume 2024, pages 9695–9717, 2024

2024

[4] [4]

Screenspot-pro: Gui grounding for professional high-resolution computer use

Kaixin Li, Ziyang Meng, Hongzhan Lin, Ziyang Luo, Yuchen Tian, Jing Ma, Zhiyong Huang, and Tat-Seng Chua. Screenspot-pro: Gui grounding for professional high-resolution computer use. InProceedings of the 33rd ACM International Conference on Multimedia, pages 8778–8786, 2025

2025

[5] [5]

OpenAI GPT-5 System Card

Aaditya Singh, Adam Fry, Adam Perelman, Adam Tart, Adi Ganesh, Ahmed El-Kishky, Aidan McLaughlin, Aiden Low, AJ Ostrow, Akhila Ananthram, et al. Openai gpt-5 system card.arXiv preprint arXiv:2601.03267, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[6] [6]

Swe-bench: Can language models resolve real-world github issues? InInternational Conference on Learning Representations, volume 2024, pages 54107–54157, 2024

Carlos E Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan. Swe-bench: Can language models resolve real-world github issues? InInternational Conference on Learning Representations, volume 2024, pages 54107–54157, 2024

2024

[7] [7]

Livecodebench: Holistic and contamination free evaluation of large language models for code

Naman Jain, Alex Gu, Wen-Ding Li, Fanjia Yan, Tianjun Zhang, Sida Wang, Armando Solar-Lezama, Koushik Sen, and Ion Stoica. Livecodebench: Holistic and contamination free evaluation of large language models for code. InInternational Conference on Learning Representations, volume 2025, pages 58791–58831, 2025

2025

[8] [8]

Component Object Model (COM)

Microsoft. Component Object Model (COM). https://learn.microsoft.com/en-us/windows/win32/ com/component-object-model--com--portal, 2024. Accessed: 2025-05-26

2024

[9] [9]

Osworld: Benchmarking multimodal agents for open-ended tasks in real computer environments.Advances in Neural Information Processing Systems, 37:52040–52094, 2024

Tianbao Xie, Danyang Zhang, Jixuan Chen, Xiaochuan Li, Siheng Zhao, Ruisheng Cao, Toh J Hua, Zhoujun Cheng, Dongchan Shin, Fangyu Lei, et al. Osworld: Benchmarking multimodal agents for open-ended tasks in real computer environments.Advances in Neural Information Processing Systems, 37:52040–52094, 2024

2024

[10] [10]

Seeclick: Harnessing gui grounding for advanced visual gui agents

Kanzhi Cheng, Qiushi Sun, Yougang Chu, Fangzhi Xu, Li YanTao, Jianbing Zhang, and Zhiyong Wu. Seeclick: Harnessing gui grounding for advanced visual gui agents. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 9313–9332, 2024

2024

[11] [11]

arXiv preprint arXiv:2410.08164 (2024)

Saaket Agashe, Jiuzhou Han, Shuyu Gan, Jiachen Yang, Ang Li, and Xin Eric Wang. Agent s: An open agentic framework that uses computers like a human.arXiv preprint arXiv:2410.08164, 2024

work page arXiv 2024

[12] [12]

UI-TARS: Pioneering Automated GUI Interaction with Native Agents

Yujia Qin, Yining Ye, Junjie Fang, Haoming Wang, Shihao Liang, Shizuo Tian, Junda Zhang, Jiahao Li, Yunxin Li, Shijue Huang, et al. Ui-tars: Pioneering automated gui interaction with native agents.arXiv preprint arXiv:2501.12326, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[13] [13]

Ufo3: Weaving the digital agent galaxy.arXiv preprint arXiv:2511.11332, 2025

Chaoyun Zhang, Liqun Li, He Huang, Chiming Ni, Bo Qiao, Si Qin, Yu Kang, Minghua Ma, Qingwei Lin, Saravan Rajmohan, et al. Ufo3: Weaving the digital agent galaxy.arXiv preprint arXiv:2511.11332, 2025

work page arXiv 2025

[14] [14]

Cogagent: A visual language model for gui agents

Wenyi Hong, Weihan Wang, Qingsong Lv, Jiazheng Xu, Wenmeng Yu, Junhui Ji, Yan Wang, Zihan Wang, Yuxiao Dong, Ming Ding, et al. Cogagent: A visual language model for gui agents. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 14281–14290, 2024

2024

[15] [15]

Appagent: Multimodal agents as smartphone users

Chi Zhang, Zhao Yang, Jiaxuan Liu, Yanda Li, Yucheng Han, Xin Chen, Zebiao Huang, Bin Fu, and Gang Yu. Appagent: Multimodal agents as smartphone users. InProceedings of the 2025 CHI Conference on Human Factors in Computing Systems, pages 1–20, 2025

2025

[16] [16]

Aria-ui: Visual grounding for gui instructions

Yuhao Yang, Yue Wang, Dongxu Li, Ziyang Luo, Bei Chen, Chao Huang, and Junnan Li. Aria-ui: Visual grounding for gui instructions. InFindings of the Association for Computational Linguistics: ACL 2025, pages 22418–22433, 2025

2025

[17] [17]

GPT-4V(ision) is a Generalist Web Agent, if Grounded

Boyuan Zheng, Boyu Gou, Jihyung Kil, Huan Sun, and Yu Su. Gpt-4v (ision) is a generalist web agent, if grounded.arXiv preprint arXiv:2401.01614, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[18] [18]

Large Language Model-Brained GUI Agents: A Survey

Chaoyun Zhang, Shilin He, Jiaxu Qian, Bowen Li, Liqun Li, Si Qin, Yu Kang, Minghua Ma, Guyue Liu, Qingwei Lin, et al. Large language model-brained gui agents: A survey.arXiv preprint arXiv:2411.18279, 2024. 10

work page internal anchor Pith review Pith/arXiv arXiv 2024

[19] [19]

Os agents: A survey on mllm-based agents for computer, phone and browser use

Xueyu Hu, Tao Xiong, Biao Yi, Zishu Wei, Ruixuan Xiao, Yurun Chen, Jiasheng Ye, Meiling Tao, Xiangxin Zhou, Ziyu Zhao, et al. Os agents: A survey on mllm-based agents for computer, phone and browser use. In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 7436–7465, 2025

2025

[20] [20]

Beyond browsing: Api-based web agents

Yueqi Song, Frank F Xu, Shuyan Zhou, and Graham Neubig. Beyond browsing: Api-based web agents. In Findings of the Association for Computational Linguistics: ACL 2025, pages 11066–11085, 2025

2025

[21] [21]

Autowebglm: A large language model-based web navigating agent

Hanyu Lai, Xiao Liu, Iat Long Iong, Shuntian Yao, Yuxuan Chen, Pengbo Shen, Hao Yu, Hanchen Zhang, Xiaohan Zhang, Yuxiao Dong, et al. Autowebglm: A large language model-based web navigating agent. InProceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, pages 5295–5306, 2024

2024

[22] [22]

Os-copilot: Towards generalist computer agents with self-improvement, 2024.URL https://arxiv

Zhiyong Wu, Chengcheng Han, Zichen Ding, Zhenmin Weng, Zhoumianze Liu, Shunyu Yao, Tao Yu, and Lingpeng Kong. Os-copilot: Towards generalist computer agents with self-improvement, 2024.URL https://arxiv. org/abs/2402.07456

work page arXiv 2024

[23] [23]

Executable code actions elicit better llm agents

Xingyao Wang, Yangyi Chen, Lifan Yuan, Yizhe Zhang, Yunzhu Li, Hao Peng, and Heng Ji. Executable code actions elicit better llm agents. InForty-first International Conference on Machine Learning, 2024

2024

[24] [24]

Coact-1: Computer-using agents with coding as actions.arXiv preprint arXiv:2508.03923, 2025

Linxin Song, Yutong Dai, Viraj Prabhu, Jieyu Zhang, Taiwei Shi, Li Li, Junnan Li, Silvio Savarese, Zeyuan Chen, Jieyu Zhao, et al. Coact-1: Computer-using agents with coding as actions.arXiv preprint arXiv:2508.03923, 2025

work page arXiv 2025

[25] [25]

Sketchgraphs: A large-scale dataset for modeling relational geometry in computer-aided design.arXiv preprint arXiv:2007.08506, 2020

Ari Seff, Yaniv Ovadia, Wenda Zhou, and Ryan P Adams. Sketchgraphs: A large-scale dataset for modeling relational geometry in computer-aided design.arXiv preprint arXiv:2007.08506, 2020

work page arXiv 2007

[26] [26]

Fusion 360 gallery: A dataset and environment for programmatic cad construction from human design sequences.ACM Transactions on Graphics (TOG), 40(4):1–24, 2021

Karl DD Willis, Yewen Pu, Jieliang Luo, Hang Chu, Tao Du, Joseph G Lambourne, Armando Solar-Lezama, and Wojciech Matusik. Fusion 360 gallery: A dataset and environment for programmatic cad construction from human design sequences.ACM Transactions on Graphics (TOG), 40(4):1–24, 2021

2021

[27] [27]

Transcad: A hierarchical transformer for cad sequence inference from point clouds

Elona Dupont, Kseniya Cherenkova, Dimitrios Mallis, Gleb Gusev, Anis Kacem, and Djamila Aouada. Transcad: A hierarchical transformer for cad sequence inference from point clouds. InEuropean Conference on Computer Vision, pages 19–36. Springer, 2024

2024

[28] [28]

Cad-llama: leveraging large language models for computer-aided design parametric 3d model generation

Jiahao Li, Weijian Ma, Xueyang Li, Yunzhong Lou, Guichun Zhou, and Xiangdong Zhou. Cad-llama: leveraging large language models for computer-aided design parametric 3d model generation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18563–18573, 2025

2025

[29] [29]

Flexcad: Unified and versatile controllable cad generation with fine-tuned large language models

Zhanwei Zhang, Shizhao Sun, Wenxiao Wang, Deng Cai, and Jiang Bian. Flexcad: Unified and versatile controllable cad generation with fine-tuned large language models. InInternational Conference on Learning Representations, volume 2025, pages 3204–3227, 2025

2025

[30] [30]

Deepcad: A deep generative network for computer-aided design models

Rundi Wu, Chang Xiao, and Changxi Zheng. Deepcad: A deep generative network for computer-aided design models. InProceedings of the IEEE/CVF international conference on computer vision, pages 6772–6782, 2021

2021

[31] [31]

Text2cad: Generating sequential cad designs from beginner-to-expert level text prompts.Advances in Neural Information Processing Systems, 37:7552–7579, 2024

Mohammad S Khan, Sankalp Sinha, Talha U Sheikh, Didier Stricker, Sk A Ali, and Muhammad Z Afzal. Text2cad: Generating sequential cad designs from beginner-to-expert level text prompts.Advances in Neural Information Processing Systems, 37:7552–7579, 2024

2024

[32] [32]

Cad-coder: Text-to-cad generation with chain-of-thought and geometric reward.Advances in Neural Information Processing Systems, 38:59765–59789, 2026

Yandong Guan, Xilin Wang, Ximing Xing, Jing Zhang, Dong Xu, and Qian Yu. Cad-coder: Text-to-cad generation with chain-of-thought and geometric reward.Advances in Neural Information Processing Systems, 38:59765–59789, 2026

2026

[33] [33]

Cad-recode: Reverse engineering cad code from point clouds

Danila Rukhovich, Elona Dupont, Dimitrios Mallis, Kseniya Cherenkova, Anis Kacem, and Djamila Aouada. Cad-recode: Reverse engineering cad code from point clouds. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 9801–9811, 2025

2025

[34] [34]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models. arXiv preprint arXiv:2402.03300, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[35] [35]

Cadmium: Fine-tuning code language models for text-driven sequential cad design.arXiv preprint arXiv:2507.09792, 2025

Prashant Govindarajan, Davide Baldelli, Jay Pathak, Quentin Fournier, and Sarath Chandar. Cadmium: Fine-tuning code language models for text-driven sequential cad design.arXiv preprint arXiv:2507.09792, 2025. 11

work page arXiv 2025

[36] [36]

Qwen Team. Qwen3.5. https://qwenlm.github.io/blog/qwen3.5/, February 2026. Accessed: 2026- 05-26

2026

[37] [37]

Claude Sonnet 4.6 system card

Anthropic. Claude Sonnet 4.6 system card. Technical report, Anthropic, February 2026

2026

[38] [38]

GLM-4.5: Agentic, Reasoning, and Coding (ARC) Foundation Models

Aohan Zeng, Xin Lv, Qinkai Zheng, Zhenyu Hou, Bin Chen, Chengxing Xie, Cunxiang Wang, Da Yin, Hao Zeng, Jiajie Zhang, et al. Glm-4.5: Agentic, reasoning, and coding (arc) foundation models.arXiv preprint arXiv:2508.06471, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[39] [39]

Cad-judge: Toward efficient morphological grading and verification for text-to-cad generation

Zheyuan Zhou, Jiayi Han, Liang Du, Naiyu Fang, Lemiao Qiu, and Shuyou Zhang. Cad-judge: Toward efficient morphological grading and verification for text-to-cad generation. InICASSP 2026-2026 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1021–1025. IEEE, 2026

2026

[40] [40]

Gpt-4o system card

OpenAI. Gpt-4o system card. https://cdn.openai.com/gpt-4o-system-card.pdf , 2024. Accessed: 2024-09-26. 12 Appendix Contents A Data Construction 13 A.1 Source Data and Textualization . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 A.2 Ground Truth COM Script Synthesis . . . . . . . . . . . . . . . . . . . . . . . . . 13 A.3 Downstream Multi-Tas...

2024

[41] [41]

sketch_id

SFT, we apply Low-Rank Adaptation (LoRA) to all linear layers with a rank of r= 8 and α= 32 . The models are trained using the AdamW optimizer with a learning rate of 1e-5, a cosine learning rate scheduler, and a warmup ratio of 0.05. To accommodate the extensive context required for code generation and error tracebacks, the maximum sequence length is set...

work page arXiv 1961

[42] [42]

Brief reasoning inside<thinking>...</thinking>

[43] [43]

A high-level decision wrapped as:“‘decision CODE (or DONE/FAIL) “‘

[44] [44]

‘ RAG Prompt (Appended to Baseline) External Knowledge Context: Here are some COM APIs that might be useful for completing this task. [ {

If and only if the decision is CODE, output a single“‘python ... “‘block. Few-Shot Prompt (Appended to Baseline) Example: 3D Modeling in Solidworks Task Instruction:Model this part in Solidworks: To construct the first part of the cylinder...[Detailed dimensions and constraints omitted for brevity]...export the model as an STL and STEP file. Output: <thin...