LLM-ADAM: A Generalizable LLM Agent Framework for Pre-Print Anomaly Detection in Additive Manufacturing
Pith reviewed 2026-05-07 02:25 UTC · model grok-4.3
The pith
LLM-ADAM decomposes G-code anomaly detection into Extractor, Reference, and Judge LLM roles and reaches 87.5% accuracy on a 200-file FFF corpus, outperforming a single-LLM baseline of 59.5%.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Structured decomposition into Extractor-LLM, Reference-LLM, and Judge-LLM roles, rather than backbone strength alone, is the dominant source of improvement, yielding 87.5% accuracy on an N=200 FFF G-code corpus spanning two printers, two materials, and five classes.
Load-bearing premise
That the three LLMs will reliably produce a correct structured schema, accurate operating ranges, and a faithful deviation interpretation when given only the G-code text and documentation, without systematic extraction or reasoning errors that vary with model version or prompt phrasing.
read the original abstract
Additive manufacturing (AM) continues to transform modern manufacturing by enabling flexible, on-demand production of complex geometries across diverse industries. Fused filament fabrication (FFF) has extended AM to laboratories, classrooms, and small production environments, but this accessibility shifts process-planning responsibility to users who may lack manufacturing expertise. A syntactically valid slicer profile can still encode thermally or geometrically harmful settings, and subtle G-code edits can alter extrusion, cooling, or adhesion before a print begins. Pre-print G-code screening catches accidental or adversarial machine-program errors before material or machine time is wasted. This paper proposes LLM-ADAM as a generalizable LLM framework for pre-print anomaly detection in AM. The framework decomposes the task into three roles: Extractor-LLM maps a G-code file to a structured process-parameter schema; Reference-LLM converts printer and material documentation into aligned operating ranges; and Judge-LLM interprets a deterministic deviation table and G-code evidence to decide whether a part is non-defective or belongs to an anomaly class. Printers, materials, and LLM backbones are interchangeable test conditions, not fixed assumptions. We evaluate the framework on an N=200 FFF G-code corpus spanning two desktop printer families, two materials, and five classes including non-defective, under-extrusion, over-extrusion, warping, and stringing. The best framework configuration reaches 87.5% accuracy, compared with 59.5% for the strongest engineered single-LLM baseline. The results show that structured decomposition, rather than backbone strength alone, is the dominant source of improvement, with defect classes identified at or near ceiling for leading configurations while residual errors concentrate on conservative false alarms for non-defective samples.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces LLM-ADAM, a three-role LLM agent framework (Extractor-LLM, Reference-LLM, Judge-LLM) for pre-print anomaly detection in FFF G-code. On an N=200 corpus spanning two printers, two materials, and five classes, the best configuration reaches 87.5% accuracy, compared with 59.5% for the strongest single-LLM baseline; the authors attribute the 28-point gain primarily to the structured decomposition rather than backbone strength.
Significance. If the attribution to role decomposition holds under controlled prompting and schema disclosure, the result would demonstrate a practical, modular way to apply LLMs to manufacturing process validation without requiring domain-specific fine-tuning. The interchangeability of printers, materials, and backbones is a positive design choice that supports generalizability claims.
major comments (2)
- [Abstract] Abstract: the central claim that 'structured decomposition, rather than backbone strength alone, is the dominant source of improvement' cannot be evaluated because neither the exact single-LLM baseline prompt template nor the JSON schemas and deterministic deviation table produced by the three-role pipeline are supplied. Without these artifacts it is impossible to distinguish architectural gain from richer multi-turn scaffolding.
- [Abstract] Abstract: the reported 87.5% accuracy and per-class ceiling performance are given without data-split details, prompt-variation ablations, or error analysis stratified by printer/material; these omissions make it impossible to assess robustness to model updates or prompt phrasing, which the weakest-assumption note identifies as load-bearing.
Simulated Author's Rebuttal
We thank the referee for the detailed and constructive feedback. Both major comments correctly identify that the abstract (and, by extension, the current manuscript) does not supply the concrete artifacts and experimental controls needed to isolate the contribution of role decomposition. We will address these omissions with targeted additions rather than by altering the core claims.
read point-by-point responses
-
Referee: [Abstract] Abstract: the central claim that 'structured decomposition, rather than backbone strength alone, is the dominant source of improvement' cannot be evaluated because neither the exact single-LLM baseline prompt template nor the JSON schemas and deterministic deviation table produced by the three-role pipeline are supplied. Without these artifacts it is impossible to distinguish architectural gain from richer multi-turn scaffolding.
Authors: We agree. In the revised manuscript we will add an appendix that contains (i) the exact single-LLM baseline prompt template used for the 59.5 % result, (ii) the JSON schema emitted by the Extractor-LLM, (iii) the Reference-LLM output format, and (iv) the deterministic deviation table passed to the Judge-LLM. With these artifacts readers can reproduce the multi-turn scaffolding and verify that the performance gap is not merely an artifact of prompt length or formatting. revision: yes
-
Referee: [Abstract] Abstract: the reported 87.5% accuracy and per-class ceiling performance are given without data-split details, prompt-variation ablations, or error analysis stratified by printer/material; these omissions make it impossible to assess robustness to model updates or prompt phrasing, which the weakest-assumption note identifies as load-bearing.
Authors: We accept the criticism. The revision will include: (a) explicit train/test split ratios and randomization seed for the N=200 corpus, (b) a prompt-variation ablation table (temperature, few-shot count, and schema phrasing), and (c) a confusion-matrix breakdown stratified by printer family and material. These additions will be placed in the experimental section and will directly support the robustness statements currently only asserted in the abstract. revision: yes
Circularity Check
No circularity: empirical accuracy measured on held-out corpus
full rationale
The abstract reports a direct empirical comparison (87.5 % vs. 59.5 %) between the three-role pipeline and a single-LLM baseline on the same N=200 held-out G-code files. No equations, fitted parameters, self-citations, or uniqueness theorems are invoked; the claimed improvement is therefore not reducible to any input by construction. The evaluation remains falsifiable by re-running the identical prompts and schemas on the released corpus.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption G-code text plus documentation suffice for deterministic extraction of all relevant process parameters and ranges
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.