Self-Improving CAD Generation Agents with Finite Element Analysis as Feedback

Guijin Son; Jehyun Park; Seyeon Park; Sunghee Ahn; Youngjae Yu

arxiv: 2605.17448 · v2 · pith:JMGKQMSGnew · submitted 2026-05-17 · 💻 cs.GR · cs.CL

Self-Improving CAD Generation Agents with Finite Element Analysis as Feedback

Guijin Son , Jehyun Park , Seyeon Park , Sunghee Ahn , Youngjae Yu This is my paper

Pith reviewed 2026-05-19 22:36 UTC · model grok-4.3

classification 💻 cs.GR cs.CL

keywords CAD generationfinite element analysisSTEP filesLLM agentsself-improving loopsgeometric reconstructionengineering validation

0 comments

The pith

CAD agents improve designs when finite element analysis and blueprint feedback close the loop between generation and engineering checks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Current LLM agents for CAD rarely produce any fully assembled multi-part files that pass strict validation tests on first attempt. The paper reframes the task as generating complete STEP files from engineering briefs and then using finite element analysis to score them against physical requirements. It adds a text blueprint schema and a 21-view image renderer as extra signals so the agent can inspect and revise its own output. These changes raise geometric accuracy on standard reconstruction benchmarks. A reader cares because the result moves AI CAD tools from visual plausibility toward artifacts that satisfy structural constraints.

Core claim

The paper claims that finite element analysis on generated STEP files, paired with a novel text-only blueprint schema and 21-view image renderer, supplies usable feedback that lets Codex and Claude Code agents self-improve, lifting geometric reconstruction from 0.444 to 0.592 Box-IoU on S2O and from 0.397 to 0.505 on Fusion360 while moving toward higher rates of meeting typed engineering requirements.

What carries the argument

The closed-loop agent that feeds finite element analysis results plus blueprint and multi-view image signals back into the next generation step to produce assembled multi-part STEP files.

If this is right

No first-attempt agent run meets all strict requirements, but the added signals measurably raise the fraction of satisfied constraints.
Geometric reconstruction improves on both S2O and Fusion360 without changing the base model.
CAD generation becomes an iterative process checked against physical and structural criteria rather than reference proximity alone.
The same feedback loop can be applied to any agent that outputs STEP files for engineering review.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The method could be tested on additional simulation domains such as thermal or fluid analysis to see if the same loop generalizes.
Combining the blueprint and image signals with constraint solvers might further reduce the gap between generated files and production-ready parts.
Similar self-correction patterns may appear in other generative tasks that currently lack quantitative physical feedback.

Load-bearing premise

Finite element analysis performed on the generated STEP files gives a reliable enough signal of real engineering fitness.

What would settle it

Compare FEA-passing designs against either physical prototypes or higher-fidelity simulations to see whether the reported compliance gains disappear.

Figures

Figures reproduced from arXiv: 2605.17448 by Guijin Son, Jehyun Park, Seyeon Park, Sunghee Ahn, Youngjae Yu.

**Figure 2.** Figure 2: Overview of the CAD-agent pipeline. A free-form engineering brief is converted into an optional schema-v4 blueprint, decomposed into construction units, assembled into a STEP artifact by a deterministic controller, and revised using rich-view inspection and FEA feedback. The controller owns execution, measurement, composition, and validation, while the agent owns design decisions and CAD-code repair. • A s… view at source ↗

**Figure 3.** Figure 3: Grouped nine-view sample for a generated wheel hub drawn from the 21-view rich-view set. The full set combines 12 axis-aligned and isometric views for exterior coverage, six close-ups for small features, and three alpha-blended x-ray views for internal mating and clearance. The strip contrasts conventional six-view coverage with selected additional views. The left close-up makes the bolt circle, concentric… view at source ↗

**Figure 4.** Figure 4: Representative S2O target items used to synthesize natural-language prompts. C Sample S2O and Fusion 360 evaluation prompts For the geometric benchmarks, each evaluation prompt is generated from the target rendering and structured metadata rather than written directly by the authors. Figures 4 and 5 show representative target items, and the boxes below give the full corresponding natural-language prompts s… view at source ↗

**Figure 5.** Figure 5: Representative Fusion 360 target items used to synthesize natural-language prompts. C.2 Fusion 360 prompt samples Full generated prompt: Fusion 360 robotic chassis This is a fabricated steel electromechanical chassis consisting of numerous sheet-like and machined bodies assembled into a rigid open frame. The primary load-bearing members are two thick side plates with generous internal cutouts to reduce mas… view at source ↗

**Figure 6.** Figure 6: First-attempt quality versus one-step FEA repair gain on Hephaestus-CCX. Each point is a [PITH_FULL_IMAGE:figures/full_fig_p018_6.png] view at source ↗

**Figure 7.** Figure 7: Strict-pass retries where the agent changes the physical load-bearing structure. The steel column becomes a braced four-chord box column, the HPVC roll-protection system sheds excess surrogate mass while preserving stiffness, and the UGV tool arm becomes a hollow box beam with cleaner root and tip selector faces. Structural retuning. The AISC 360-22 steel column is the clearest load-path change and corres… view at source ↗

**Figure 8.** Figure 8: Strict-pass retries where the decisive change is simplification or hidden mass-property repair. The launcher removes fragile over-detailed geometry and routes load through a simpler body. The rollcage simplifies an unstable dense cage surrogate into an FIA-compatible tube layout with compliance metadata. The spacecraft panel looks similar in surface view, but the field map reveals the density correction th… view at source ↗

**Figure 9.** Figure 9: Strict-pass retries dominated by checker-contract repair. These artifacts already satisfy much of the underlying physics, but fail strict grading until the generated metadata exposes the required metric names, mass fields, selector bindings, or mesh-derived mass aliases. not the shape, but the bridge between the generated artifact and the evaluator’s typed engineering contract. 21 [PITH_FULL_IMAGE:figures… view at source ↗

**Figure 10.** Figure 10: Per-item engineering-domain distribution across the 466-case candidate pool, with the raw domain field grouped into thirteen broad buckets. Each bar is split into single-part and multi-part segments; the right-of-bar number is the total count and percentage of the pool. Aerospace and ground-vehicle cases together account for over half the pool, but every bucket is represented in both subsets, which is wha… view at source ↗

**Figure 11.** Figure 11: Candidate-pool brief distribution by catalog, single-part vs multi-part. The intercollegiate catalog (i) and the foundational A-series (a) account for the largest share of the pool; engineering standards (s) and patents/datasheets (pt) provide the bulk of the strict-spec briefs. 0 500 1000 1500 2000 requirement count across the 466 briefs structural analysis vibration analysis buckling analysis unknown th… view at source ↗

**Figure 12.** Figure 12: Distribution of requirement type across all pass/fail criteria in the 466-case pool. Structural-analysis criteria dominate; buckling, vibration, thermal, dimensional, geometric, and material-compliance checks each contribute a meaningful share. The two smallest types (fluid, radiation) sit outside CalculiX’s scope and are tracked as future-work analyses. end-to-end exercised by the 50 cases alone. (iii) D… view at source ↗

**Figure 13.** Figure 13: Catalog coverage of the curated 50-case benchmark against the full 466-case candidate pool. Bars are the pool count; red overlay is the count selected for Hephaestus-CCX. The selection over-samples engineering-standards (s) and aerospace (sa) catalogs because those briefs exercise the strictest pass/fail rubrics, and samples the I-series and A-series lightly relative to their pool share to keep the curate… view at source ↗

read the original abstract

Computer-aided design (CAD) is the backbone of modern industrial design, yet learned CAD generators still fall short of real engineering pipelines: they neither iterate like engineers nor evaluate what engineering requires. Prior work has treated CAD generation as two disjoint steps, part synthesis and assembly, where the former is graded by proximity to a gold reference and the latter, when handled at all, is reduced to a separate constraint solving step. In this work, we introduce a more industry-native task formulation that requires a model to produce a fully assembled multi-part STEP file from a free-form engineering brief, which is then validated via finite element analysis (FEA). FEA validation reveals that Codex (GPT-5.5) and Claude Code (Opus-4.7) agents do not produce a single strict-passing artifact in the main first-attempt sweep, with the best configuration meeting only about 20% of typed requirements on average. Moreover, we introduce two additional supervision signals, a novel text-only blueprint schema and a 21-view image renderer that aids the agent's visual inspection, that better align the generation loop with how engineers iterate in practice. On S2O and Fusion360, the same feedback tools improve geometric reconstruction, with GPT-5.5/xhigh rising from 0.444 to 0.592 Box-IoU on S2O and from 0.397 to 0.505 on Fusion360. Together these signals move CAD programs toward artifacts that are not only visually plausible but also checked against physical and structural requirements.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper sets up a CAD generation task with FEA validation and adds blueprint plus multi-view feedback, but only reports geometric IoU gains without showing those translate to better FEA outcomes.

read the letter

The main point is that this work pushes CAD agents toward producing assembled STEP files that can be checked for structural validity with finite element analysis, yet the quantitative results stay on geometry and leave the physical improvement claim untested. They define an end-to-end task from engineering brief to full multi-part model, run FEA on the output, and show that leading agents produce zero strict passes while meeting only about 20 percent of requirements on average. They then introduce a text blueprint schema and a 21-view renderer as extra signals in the loop, which lift Box-IoU on S2O and Fusion360 for the GPT-5.5 setup. That combination of task framing and specific feedback tools is the clearest new element. It does a decent job highlighting how current generators ignore engineering constraints and sketching a more iterative, visually grounded process that matches how designers actually work. The baseline failure numbers are also straightforward and worth having in the literature. The soft spot is the gap between the geometric lifts and any FEA or constraint-passing numbers. The abstract and results give before-and-after IoU but no corresponding before-and-after FEA scores, violation counts, or pass rates, so the claim that the new signals move designs toward real engineering requirements rests on an unshown correlation rather than direct evidence. Minor details like dataset construction or exact agent prompting would also help, but the missing FEA link is the load-bearing one. This is aimed at groups working on generative design, agent loops, and simulation-in-the-loop training. Readers who want concrete ideas for blueprint representations or multi-view inspection could pick up useful pieces, while anyone focused on verified structural performance would need the extra metrics. It is worth sending to peer review because the task is timely and the failure case is clearly documented, even if the positive results need tightening.

Referee Report

2 major / 1 minor

Summary. The paper formulates CAD generation as producing fully assembled multi-part STEP files from free-form engineering briefs, with validation via finite element analysis (FEA). It reports that Codex (GPT-5.5) and Claude Code (Opus-4.7) agents produce no strict-passing artifacts in a first-attempt sweep, satisfying only ~20% of typed requirements on average. The authors introduce a text-only blueprint schema and 21-view image renderer as additional feedback signals; these yield Box-IoU gains from 0.444 to 0.592 on S2O and from 0.397 to 0.505 on Fusion360 for the GPT-5.5/xhigh configuration. The central thesis is that these signals, combined with FEA feedback, move outputs toward artifacts that satisfy real engineering requirements.

Significance. If the core premise holds, the work could meaningfully advance self-improving CAD agents by closing the gap between geometric plausibility and physical/structural validity. The task reformulation and explicit use of FEA as a feedback loop represent a concrete step beyond reference-based metrics; the reported agent failure rates and the two new supervision signals are useful empirical anchors for the field.

major comments (2)

[Abstract / Results] Abstract and results: the claim that the blueprint schema and 21-view renderer improve engineering fidelity rests on an untested correlation. Geometric Box-IoU lifts are quantified, yet no before/after FEA scores, constraint-violation counts, or change in the fraction of artifacts meeting typed requirements are reported; without these, the causal link between the new signals and satisfaction of physical requirements cannot be assessed.
[Evaluation] Evaluation protocol: the manuscript states that FEA validation reveals zero strict-passing artifacts and ~20% average requirement satisfaction, but provides no table or section detailing how FEA outputs are mapped to the typed requirements or how the feedback loop uses FEA scores to drive self-improvement iterations.

minor comments (1)

[Abstract] The abstract would benefit from a concise definition or example of the 'typed requirements' used in the 20% figure.

Simulated Author's Rebuttal

2 responses · 0 unresolved

Thank you for the constructive feedback. The points raised highlight opportunities to strengthen the empirical support for our claims and to clarify the evaluation protocol. We address each major comment below and commit to revisions that directly respond to the concerns.

read point-by-point responses

Referee: [Abstract / Results] Abstract and results: the claim that the blueprint schema and 21-view renderer improve engineering fidelity rests on an untested correlation. Geometric Box-IoU lifts are quantified, yet no before/after FEA scores, constraint-violation counts, or change in the fraction of artifacts meeting typed requirements are reported; without these, the causal link between the new signals and satisfaction of physical requirements cannot be assessed.

Authors: We agree that the manuscript would benefit from direct before-and-after metrics on FEA outcomes and requirement satisfaction to substantiate the link to physical validity. The reported Box-IoU gains demonstrate improved geometric fidelity, which we view as a prerequisite for engineering requirements, but we did not quantify the corresponding changes in FEA pass rates or typed-requirement compliance for the blueprint and multi-view configurations. In the revised version we will re-evaluate the GPT-5.5/xhigh and Claude configurations with and without the new signals, reporting delta values for FEA scores, constraint-violation counts, and the fraction of artifacts meeting typed requirements. These additions will make the causal contribution of the supervision signals explicit. revision: yes
Referee: [Evaluation] Evaluation protocol: the manuscript states that FEA validation reveals zero strict-passing artifacts and ~20% average requirement satisfaction, but provides no table or section detailing how FEA outputs are mapped to the typed requirements or how the feedback loop uses FEA scores to drive self-improvement iterations.

Authors: We acknowledge that the current text describes the FEA integration at a high level without a dedicated mapping table or explicit iteration diagram. The manuscript does define the typed requirements and states that FEA is used for validation and feedback, yet the precise translation from FEA quantities (e.g., von Mises stress thresholds, displacement limits) to requirement satisfaction and the prompt-update mechanism for self-improvement are not tabulated. In revision we will insert a new subsection (with accompanying table and pseudocode) that (1) lists the FEA-derived criteria for each typed requirement and (2) details how the scalar FEA scores are injected into the agent’s next-turn prompt to close the self-improvement loop. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical IoU gains reported from added feedback signals without any derivation or fit reducing to inputs.

full rationale

The paper describes an empirical task formulation for CAD generation from engineering briefs, followed by FEA validation and introduction of blueprint and 21-view image feedback. Reported results consist of direct measurements: zero strict-passing artifacts in baseline sweeps, ~20% requirement compliance, and specific Box-IoU lifts (0.444 to 0.592 on S2O; 0.397 to 0.505 on Fusion360) when the new signals are added. No equations, parameter fittings, self-definitional loops, or load-bearing self-citations appear in the provided text that would make any claimed improvement equivalent to its own inputs by construction. The evaluation chain relies on external geometric and FEA metrics that remain independent of the generation process.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Only abstract available so ledger is necessarily incomplete; no explicit free parameters, axioms, or invented entities are stated in the provided text.

pith-pipeline@v0.9.0 · 5820 in / 1293 out tokens · 46611 ms · 2026-05-19T22:36:46.130529+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

FEA validation reveals that Codex ... do not produce a single strict-passing artifact ... 21-view image renderer ... improve geometric reconstruction, with GPT-5.5/xhigh rising from 0.444 to 0.592 Box-IoU
IndisputableMonolith/Foundation/AlexanderDuality.lean alexander_duality_circle_linking unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

rich-view image judge renders the STEP from 21 calibrated views ... finite-element feedback from CalculiX

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.