pith. sign in

arxiv: 2605.15218 · v1 · pith:U7RVE3AUnew · submitted 2026-05-12 · 💻 cs.AI · cs.CE

CAX-Agent: A Lightweight Agent Harness for Reliable APDL Automation

Pith reviewed 2026-05-19 17:35 UTC · model grok-4.3

classification 💻 cs.AI cs.CE
keywords agent harnessAPDL automationrecovery policylarge language modelsfinite element simulationMAPDLreliability
0
0 comments X

The pith

A model-driven recovery policy inside a lightweight agent harness raises APDL automation completion rates above 92 percent.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper introduces CAX-Agent as middleware that sits between an LLM service and an MAPDL solver to manage tool calls, maintain workflow state, and handle faults through an escalating recovery ladder. The evaluation runs three strategies—no recovery, rule-only patching, and model-driven regeneration—across 50 standard structural benchmarks with 450 total case-runs and blind human scoring. The model-only approach records the highest completion rate, task score, total score, and zero-intervention rate while producing large performance gaps over the baselines. These outcomes indicate that structured orchestration can turn inconsistent LLM outputs into reliable engineering automation.

Core claim

CAX-Agent organizes execution into three layers—LLM service, agent harness, and solver backend—with a recovery ladder that escalates from deterministic rule patching through model-driven regeneration to context enrichment and human intervention. Evaluation of the three recovery strategies on 50 standard structural benchmarks shows that model_only achieves a 0.9267 completion rate, 3.59/4 task score, 9.16/10 total score, and 0.84 zero-intervention rate, outperforming rule_only and no_recovery with large effect sizes.

What carries the argument

The recovery ladder inside the agent harness that escalates from deterministic rule patching through model-driven regeneration to context enrichment and human intervention.

If this is right

  • Model_only recovery produces a 0.9267 completion rate and 9.16 total score on the evaluated benchmarks.
  • It reaches an 0.84 zero-intervention rate while the rule-only and no-recovery baselines stay at zero.
  • The performance gaps show large effect sizes with Cliff's delta between 0.81 and 0.87.
  • Human scoring maintains strong agreement with quadratic weighted Cohen's kappa of 0.84.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If the same recovery ladder works on complex geometries, agent harnesses could support production use of LLMs for structural simulations.
  • The three-layer separation of LLM, harness, and solver may apply to other engineering software that needs reliable tool orchestration.
  • Hybrid policies that blend rule patching with selective model calls could be tested next to improve results further.

Load-bearing premise

The performance differences seen on simple benchmark geometries will hold when the same recovery ladder is applied to more complex real-world geometries and loading conditions.

What would settle it

Re-running the three recovery strategies on a fresh set of complex geometries and loading conditions and finding that model_only loses its advantage in completion rate and scores.

Figures

Figures reproduced from arXiv: 2605.15218 by Chenying Lin, Haiyan Qiang, Liang Yu, Ran Wang, Yichen Hai, Yi He.

Figure 1
Figure 1. Figure 1: End-to-end UI example from a representative modal analysis run. The system autonomously generates the APDL script, executes it in MAPDL, and [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: CAX-Agent runtime architecture showing the three-layer harness design with recovery policy selection and feedback loops. [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 4
Figure 4. Figure 4: Task-type breakdown over static, modal, and thermal subsets. [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗
Figure 3
Figure 3. Figure 3: Comparison of completion rate, scaled average total score, and zero [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗
Figure 5
Figure 5. Figure 5: Interquartile score ranges and medians for each strategy. Model [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗
Figure 7
Figure 7. Figure 7: Static pin-joint case (100 × 50 × 10 plate with a ϕ20 hole and ϕ20 pin under 2000N lateral load). This example illustrates that failure is tied to geometry-meshing sensitivity rather than simply the number of parts in the assembly. VI. CONCLUSION This paper presented CAX-Agent, a lightweight agent harness for MAPDL-based finite-element automation, and evalu￾ated its recovery component through a controlled … view at source ↗
Figure 6
Figure 6. Figure 6: Mesh re-partition and re-meshing workflow. The visual pipeline [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗
read the original abstract

Large language models deployed for MAPDL finite-element simulation face practical reliability challenges: without structured execution control, tool encapsulation, and fault recovery, outputs may be inconsistent and task failures are common. The Agent Harness paradigm addresses this by inserting domain-specific orchestration middleware that manages tool lifecycles, workflow state, and recovery escalation. This paper presents the architecture of CAX-Agent, a lightweight agent harness purpose-built for MAPDL automation, and empirically evaluates one of its core components -- the recovery policy.CAX-Agent organizes execution into three layers -- LLM service, agent harness, and solver backend -- with a recovery ladder that escalates from deterministic rule patching through model-driven regeneration to context enrichment and human intervention. We evaluate three recovery strategies (no_recovery, rule_only, and model_only) on 50 standard structural benchmarks with three repeated runs per strategy (450 case-runs total). Two independent human raters score task completion under blind conditions; inter-rater agreement is strong (quadratic weighted Cohen's kappa = 0.84, 96 percent of score pairs within one point). Model_only achieves the best completion rate (0.9267), task score (3.59/4), total score (9.16/10), and zero-intervention rate (0.84), outperforming rule_only (0.7733, 3.17/4, 7.03/10, 0.00) and no_recovery (0.6933, 2.74/4, 5.60/10, 0.00) with large effect sizes (Cliff's delta = 0.81-0.87). The benchmark uses deliberately simple geometries to isolate recovery-policy effects; we discuss the scope of these findings and directions for broader validation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The paper introduces CAX-Agent, a lightweight agent harness for reliable APDL (MAPDL) finite-element simulation automation using LLMs. It describes a three-layer architecture (LLM service, agent harness, solver backend) incorporating a recovery ladder that escalates from deterministic rule patching through model-driven regeneration to context enrichment and human intervention. The central empirical contribution is a controlled evaluation of three recovery strategies (no_recovery, rule_only, model_only) on 50 standard structural benchmarks with three repeated runs per strategy (450 case-runs total). Two independent raters perform blind scoring with strong inter-rater agreement (quadratic weighted Cohen's kappa = 0.84). Model_only achieves the highest completion rate (0.9267), task score (3.59/4), total score (9.16/10), and zero-intervention rate (0.84), outperforming the baselines with large effect sizes (Cliff's delta = 0.81-0.87). The benchmarks employ deliberately simple geometries to isolate recovery effects, with explicit discussion of scope limits and directions for broader validation.

Significance. If the observed superiority of model-driven recovery generalizes, the work offers a practical middleware approach to improve consistency and fault tolerance in LLM-driven engineering simulation workflows. Strengths include the repeated-run design, blind human evaluation, high inter-rater reliability, quantitative metrics with effect sizes, and transparent acknowledgment of the simple-geometry scope. These elements provide a solid foundation for the empirical claim on the tested regime.

major comments (1)
  1. [Abstract and Evaluation section] The central claim that model_only provides reliable APDL automation rests on performance differences observed exclusively with 50 deliberately simple structural benchmarks chosen to isolate recovery-policy effects. While the abstract and discussion appropriately flag this scope limit, the manuscript would benefit from a more explicit argument or preliminary data showing that the performance ordering is robust when error modes shift under nonlinear materials, contact definitions, multi-step loading, or larger meshes (as these alter script complexity and regeneration opportunities). This directly affects the strength of the broader assertion beyond the controlled setting.
minor comments (2)
  1. [Abstract] The abstract refers to 'standard structural benchmarks' without specifying selection criteria, sources, or exact characteristics of the 50 cases; adding this detail (e.g., in a dedicated benchmark subsection) would strengthen reproducibility.
  2. [Evaluation] The scoring rubric for the task score (out of 4) and total score (out of 10) is not detailed in the provided abstract; including the precise criteria used by raters would aid interpretation of the reported means.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive comment on generalizability and for the overall positive evaluation. We address the major comment below and have revised the manuscript to strengthen the relevant discussion.

read point-by-point responses
  1. Referee: [Abstract and Evaluation section] The central claim that model_only provides reliable APDL automation rests on performance differences observed exclusively with 50 deliberately simple structural benchmarks chosen to isolate recovery-policy effects. While the abstract and discussion appropriately flag this scope limit, the manuscript would benefit from a more explicit argument or preliminary data showing that the performance ordering is robust when error modes shift under nonlinear materials, contact definitions, multi-step loading, or larger meshes (as these alter script complexity and regeneration opportunities). This directly affects the strength of the broader assertion beyond the controlled setting.

    Authors: We agree that an explicit argument for why the observed performance ordering may hold under shifted error modes would strengthen the paper. The original design intentionally used simple geometries to isolate recovery-policy effects from confounding factors such as mesh size or material nonlinearity, as already noted in the abstract and discussion. In the revised manuscript we have expanded the Discussion section with a more detailed argument: the dominant failure modes addressed by model-driven recovery (syntax errors, missing parameters, and inconsistent command sequences) remain prevalent when nonlinear materials, contacts, or multi-step loads are introduced, and the regeneration and context-enrichment steps operate at the script level rather than depending on geometric simplicity. Larger meshes primarily increase solve time rather than altering the scripting error distribution targeted by the harness. We have also clarified the planned validation steps for these regimes. We do not add new empirical data in this revision, as the controlled scope was fixed by the study design; the added argument is therefore the primary change. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical performance comparison is self-contained

full rationale

The paper reports direct experimental results from running three recovery strategies (no_recovery, rule_only, model_only) across 450 case-runs on 50 simple structural benchmarks. Completion rates, task scores, total scores, zero-intervention rates, and Cliff's deltas are measured outcomes scored by blind human raters with reported inter-rater agreement. No equations, parameter fits, derivations, or self-citations are present that would reduce these measured quantities to prior inputs by construction. The central claim is an observed ordering on the chosen benchmark set; the paper explicitly notes the deliberate simplicity of the geometries and does not derive or predict results outside that regime. This is a standard empirical evaluation with no load-bearing circular steps.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No mathematical derivations, free parameters, or new physical entities are introduced. The work rests on standard assumptions about LLM tool-calling reliability and human rater consistency, none of which are formalized as axioms in the abstract.

pith-pipeline@v0.9.0 · 5866 in / 1361 out tokens · 52871 ms · 2026-05-19T17:35:23.801661+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

26 extracted references · 26 canonical work pages · 2 internal anchors

  1. [1]

    Attention is all you need,

    A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin, “Attention is all you need,” inAdvances in Neural Information Processing Systems (NeurIPS), 2017

  2. [2]

    BERT: Pre- training of deep bidirectional transformers for language understanding,

    J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “BERT: Pre- training of deep bidirectional transformers for language understanding,” inProceedings of the Conference of the North American Chapter of the Association for Computational Linguistics (NAACL-HLT), 2019

  3. [3]

    Language models are few-shot learners,

    T. B. Brown, B. Mann, N. Ryder, M. Subbiah, J. Kaplan, P. Dhariwal et al., “Language models are few-shot learners,” inAdvances in Neural Information Processing Systems (NeurIPS), 2020

  4. [4]

    ReAct: Synergizing reasoning and acting in language models,

    S. Yao, J. Zhao, D. Yu, N. Du, I. Shafran, K. Narasimhan, and Y . Cao, “ReAct: Synergizing reasoning and acting in language models,” in International Conference on Learning Representations (ICLR), 2023

  5. [5]

    From Agent Loops to Structured Graphs:A Scheduler-Theoretic Framework for LLM Agent Execution

    H. Wei, “From agent loops to structured graphs: A scheduler-theoretic framework for LLM agent execution,”arXiv preprint arXiv:2604.11378, 2026

  6. [6]

    KAIJU: An Executive Kernel for Intent-Gated Execution of LLM Agents

    C. Guerin and F. Guerin, “KAIJU: An executive kernel for intent-gated execution of LLM agents,”arXiv preprint arXiv:2604.02375, 2026

  7. [7]

    Generative AI meets CAD: Enhancing engineering design to manufacturing processes with large language models,

    A. Daareyni, A. Martikkala, H. Mokhtarian, and I. F. Ituarte, “Generative AI meets CAD: Enhancing engineering design to manufacturing processes with large language models,”The International Journal of Advanced Manufacturing Technology, Jun. 2025

  8. [8]

    An investigation on utilizing large language model for industrial computer-aided design automation,

    H. Deng, S. Khan, and J. A. Erkoyuncu, “An investigation on utilizing large language model for industrial computer-aided design automation,” Procedia CIRP, vol. 128, pp. 221–226, 2024

  9. [9]

    Using a large language model to generate a design structure matrix,

    E. C. Koh, “Using a large language model to generate a design structure matrix,”Natural Language Processing Journal, vol. 9, p. 100103, Dec. 2024

  10. [10]

    Liang, Z

    X. Liang, Z. Wang, and J. Liu, “Towards a self-cognitive complex product design system: A fine-grained multi-modal feature recognition and semantic understanding approach using large language models in mechanical engineering,”Advanced Engineering Informatics, vol. 65, p. 103265, May 2025

  11. [11]

    ChatCFD: A large language model-driven agent for end-to- end computational fluid dynamics automation with structured knowledge and reasoning,

    E. Fan, K. Hu, Z. Wu, J. Ge, J. Miao, Y . Zhang, H. Sun, W. Wang, and T. Zhang, “ChatCFD: A large language model-driven agent for end-to- end computational fluid dynamics automation with structured knowledge and reasoning,”Advanced Intelligent Discovery, 2025

  12. [12]

    Large language model-empowered next-generation computer-aided engineering,

    J. Guo, C. Park, D. Qian, T. J. Hughes, and W. K. Liu, “Large language model-empowered next-generation computer-aided engineering,” Computer Methods in Applied Mechanics and Engineering, vol. 450, p. 118591, Mar. 2026

  13. [13]

    Large language models for manufacturing,

    Y . Li, H. Zhao, H. Jiang, Y . Pan, Z. Liu, Z. Wu, P. Shu, J. Tian, T. Yang, S. Xu, Y . Lyu, P. Blenk, J. Pence, J. Rupram, E. Banu, K. Song, D. Zhu, X. Wang, and T. Liu, “Large language models for manufacturing,”Journal of Manufacturing Systems, vol. 86, pp. 516–545, Jun. 2026

  14. [14]

    From concept to manufacturing: Evaluating vision- language models for engineering design,

    C. Picard, K. M. Edwards, A. C. Doris, B. Man, G. Giannone, M. F. Alam, and F. Ahmed, “From concept to manufacturing: Evaluating vision- language models for engineering design,”Artificial Intelligence Review, vol. 58, no. 9, Jul. 2025

  15. [15]

    Review of empowering computer-aided engineering with artificial intelligence,

    X. Zhao, X.-M. Tong, F. Ning, M.-L. Cai, F. Han, and H. Li, “Review of empowering computer-aided engineering with artificial intelligence,” Advances in Manufacturing, vol. 14, pp. 103–143, 2025

  16. [16]

    A large language model-enabled machining process knowledge graph construction method for intelligent process planning,

    Q. Xu, F. Qiu, G. Zhou, C. Zhang, K. Ding, F. Chang, F. Lu, Y . Yu, D. Ma, and J. Liu, “A large language model-enabled machining process knowledge graph construction method for intelligent process planning,” Advanced Engineering Informatics, vol. 65, p. 103244, May 2025

  17. [17]

    Large language models for high-level computer-aided process planning in a distributed manufacturing paradigm,

    E. Stathatos, P. Benardos, G.-C. V osniakos, D. Gross, H. Spieker, and A. Gotlieb, “Large language models for high-level computer-aided process planning in a distributed manufacturing paradigm,”Robotics and Computer-Integrated Manufacturing, vol. 100, p. 103233, Aug. 2026

  18. [18]

    Fine-tuning a large language model for automated code compliance of building regulations,

    J. Shi, W. Solihin, and J. K. W. Yeoh, “Fine-tuning a large language model for automated code compliance of building regulations,”Advanced Engineering Informatics, vol. 68, p. 103676, Nov. 2025

  19. [19]

    Leveraging large language models for human-machine collaborative troubleshooting of complex industrial equipment faults,

    S. Wen, F. Li, W. Zhuang, X. Pan, W. Yu, J. Bao, and X. Li, “Leveraging large language models for human-machine collaborative troubleshooting of complex industrial equipment faults,”Advanced Engineering Informatics, vol. 65, p. 103235, May 2025

  20. [20]

    A knowledge graph-enhanced large language model for question answering of hydraulic structure safety management,

    D. Zhang, G. Ma, T. Qu, X. Wang, W. Zhou, and X. Wang, “A knowledge graph-enhanced large language model for question answering of hydraulic structure safety management,”Advanced Engineering Informatics, vol. 66, p. 103468, Jul. 2025

  21. [21]

    An integrated approach for automatic safety inspection in construction: Domain knowledge with multimodal large language model,

    Y . Wang, H. Luo, and W. Fang, “An integrated approach for automatic safety inspection in construction: Domain knowledge with multimodal large language model,”Advanced Engineering Informatics, vol. 65, p. 103246, May 2025

  22. [22]

    AirfoilAgent: Airfoil aero- dynamics optimization design via large language model multi-agent collaborations,

    Y . Fan, H. Zhan, M. Zhang, and B. Mi, “AirfoilAgent: Airfoil aero- dynamics optimization design via large language model multi-agent collaborations,”Advanced Engineering Informatics, vol. 71, p. 104246, Apr. 2026

  23. [23]

    A review on large language models for industrial embodied intelligence,

    J. Zhuet al., “A review on large language models for industrial embodied intelligence,”Advanced Engineering Informatics, vol. 73, p. 104602, Jul. 2026

  24. [24]

    Feabench: Evaluating language models on multiphysics reasoning ability

    N. Mudur, H. Cui, S. Venugopalan, P. Raccuglia, M. P. Brenner, and P. Norgaard, “FEABench: Evaluating language models on multiphysics reasoning ability,” inNeurIPS 2024 Workshop on Mathematical Reason- ing and AI (MATH-AI), 2024, arXiv:2504.06260

  25. [25]

    AutoFEA: Enhancing AI copilot by integrating finite element analysis using large language models with graph neural networks,

    S. Hou, R. Johnson, R. Makhija, L. Chen, and Y . Ye, “AutoFEA: Enhancing AI copilot by integrating finite element analysis using large language models with graph neural networks,” inProceedings of the AAAI Conference on Artificial Intelligence, vol. 39, no. 22, 2025

  26. [26]

    The dual-state architecture for reliable LLM agents,

    M. Thompson, “The dual-state architecture for reliable LLM agents,” arXiv preprint arXiv:2512.20660, 2026