LLM-ADAM: A Generalizable LLM Agent Framework for Pre-Print Anomaly Detection in Additive Manufacturing

Ahmadreza Eslaminia; Cameron Smith; Chenhui Shao; Chuhan Cai; Klara Nahrstedt; Rajiv Malhotra; Ruo-Syuan Mei; Shichen Li

arxiv: 2605.03328 · v1 · submitted 2026-05-05 · 💻 cs.LG · cs.AI

LLM-ADAM: A Generalizable LLM Agent Framework for Pre-Print Anomaly Detection in Additive Manufacturing

Ahmadreza Eslaminia , Chuhan Cai , Cameron Smith , Ruo-Syuan Mei , Shichen Li , Rajiv Malhotra , Klara Nahrstedt , Chenhui Shao This is my paper

Pith reviewed 2026-05-07 02:25 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords frameworkg-codemanufacturinganomalynon-defectivepre-printadditivebefore

0 comments

The pith

LLM-ADAM decomposes G-code anomaly detection into Extractor, Reference, and Judge LLM roles and reaches 87.5% accuracy on a 200-file FFF corpus, outperforming a single-LLM baseline of 59.5%.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The authors split the detection task into three specialized LLM calls. One LLM reads the G-code and extracts process parameters such as temperatures and speeds into a fixed schema. A second LLM reads printer and material manuals and turns them into allowable ranges for those same parameters. A third LLM receives a simple table of deviations plus excerpts from the G-code and decides whether the file is safe or belongs to one of four defect classes. The framework was tested on 200 G-code files from two desktop printers and two materials. The best three-role configuration achieved 87.5 percent accuracy while the strongest single-LLM prompt reached only 59.5 percent. Errors that remained were mostly conservative false alarms on good files rather than missed defects. The authors emphasize that swapping printers, materials, or the underlying LLM model requires only new documentation and no retraining.

Core claim

Structured decomposition into Extractor-LLM, Reference-LLM, and Judge-LLM roles, rather than backbone strength alone, is the dominant source of improvement, yielding 87.5% accuracy on an N=200 FFF G-code corpus spanning two printers, two materials, and five classes.

Load-bearing premise

That the three LLMs will reliably produce a correct structured schema, accurate operating ranges, and a faithful deviation interpretation when given only the G-code text and documentation, without systematic extraction or reasoning errors that vary with model version or prompt phrasing.

read the original abstract

Additive manufacturing (AM) continues to transform modern manufacturing by enabling flexible, on-demand production of complex geometries across diverse industries. Fused filament fabrication (FFF) has extended AM to laboratories, classrooms, and small production environments, but this accessibility shifts process-planning responsibility to users who may lack manufacturing expertise. A syntactically valid slicer profile can still encode thermally or geometrically harmful settings, and subtle G-code edits can alter extrusion, cooling, or adhesion before a print begins. Pre-print G-code screening catches accidental or adversarial machine-program errors before material or machine time is wasted. This paper proposes LLM-ADAM as a generalizable LLM framework for pre-print anomaly detection in AM. The framework decomposes the task into three roles: Extractor-LLM maps a G-code file to a structured process-parameter schema; Reference-LLM converts printer and material documentation into aligned operating ranges; and Judge-LLM interprets a deterministic deviation table and G-code evidence to decide whether a part is non-defective or belongs to an anomaly class. Printers, materials, and LLM backbones are interchangeable test conditions, not fixed assumptions. We evaluate the framework on an N=200 FFF G-code corpus spanning two desktop printer families, two materials, and five classes including non-defective, under-extrusion, over-extrusion, warping, and stringing. The best framework configuration reaches 87.5% accuracy, compared with 59.5% for the strongest engineered single-LLM baseline. The results show that structured decomposition, rather than backbone strength alone, is the dominant source of improvement, with defect classes identified at or near ceiling for leading configurations while residual errors concentrate on conservative false alarms for non-defective samples.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

LLM-ADAM shows a three-role agent can lift G-code anomaly detection from 59.5 % to 87.5 % on a 200-file multi-printer corpus, but the abstract gives no baseline prompt or schema details so the attribution to role decomposition stays unverified.

read the letter

The main takeaway is straightforward: the authors split the task into Extractor, Reference, and Judge LLMs and report a clear accuracy jump on held-out G-code from two printers and two materials. That decomposition is a concrete, previously unreported application for pre-print screening, and the numbers are given for five classes including non-defective prints. The setup treats printers, materials, and backbones as interchangeable, which is the right framing for a practical tool. What the abstract does not show is the exact single-LLM prompt used for the 59.5 % baseline or the JSON schemas and deviation tables actually produced by the first two stages. Without those, it is impossible to tell whether the 28-point gain comes from the role split or simply from richer multi-turn scaffolding that could be collapsed into one call. The corpus is modest, residual errors are described as conservative false alarms on good parts, and no prompt-sensitivity or model-version checks are mentioned. The work is therefore a useful early demonstration rather than a settled result. It is aimed at groups already running LLM agents on manufacturing data or building lightweight quality checks for desktop FFF users. A serious referee should see the full methods, the baseline templates, and the raw outputs before deciding whether the architectural claim holds. I would send it to review once those pieces are supplied.

Referee Report

2 major / 0 minor

Summary. The paper introduces LLM-ADAM, a three-role LLM agent framework (Extractor-LLM, Reference-LLM, Judge-LLM) for pre-print anomaly detection in FFF G-code. On an N=200 corpus spanning two printers, two materials, and five classes, the best configuration reaches 87.5% accuracy, compared with 59.5% for the strongest single-LLM baseline; the authors attribute the 28-point gain primarily to the structured decomposition rather than backbone strength.

Significance. If the attribution to role decomposition holds under controlled prompting and schema disclosure, the result would demonstrate a practical, modular way to apply LLMs to manufacturing process validation without requiring domain-specific fine-tuning. The interchangeability of printers, materials, and backbones is a positive design choice that supports generalizability claims.

major comments (2)

[Abstract] Abstract: the central claim that 'structured decomposition, rather than backbone strength alone, is the dominant source of improvement' cannot be evaluated because neither the exact single-LLM baseline prompt template nor the JSON schemas and deterministic deviation table produced by the three-role pipeline are supplied. Without these artifacts it is impossible to distinguish architectural gain from richer multi-turn scaffolding.
[Abstract] Abstract: the reported 87.5% accuracy and per-class ceiling performance are given without data-split details, prompt-variation ablations, or error analysis stratified by printer/material; these omissions make it impossible to assess robustness to model updates or prompt phrasing, which the weakest-assumption note identifies as load-bearing.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive feedback. Both major comments correctly identify that the abstract (and, by extension, the current manuscript) does not supply the concrete artifacts and experimental controls needed to isolate the contribution of role decomposition. We will address these omissions with targeted additions rather than by altering the core claims.

read point-by-point responses

Referee: [Abstract] Abstract: the central claim that 'structured decomposition, rather than backbone strength alone, is the dominant source of improvement' cannot be evaluated because neither the exact single-LLM baseline prompt template nor the JSON schemas and deterministic deviation table produced by the three-role pipeline are supplied. Without these artifacts it is impossible to distinguish architectural gain from richer multi-turn scaffolding.

Authors: We agree. In the revised manuscript we will add an appendix that contains (i) the exact single-LLM baseline prompt template used for the 59.5 % result, (ii) the JSON schema emitted by the Extractor-LLM, (iii) the Reference-LLM output format, and (iv) the deterministic deviation table passed to the Judge-LLM. With these artifacts readers can reproduce the multi-turn scaffolding and verify that the performance gap is not merely an artifact of prompt length or formatting. revision: yes
Referee: [Abstract] Abstract: the reported 87.5% accuracy and per-class ceiling performance are given without data-split details, prompt-variation ablations, or error analysis stratified by printer/material; these omissions make it impossible to assess robustness to model updates or prompt phrasing, which the weakest-assumption note identifies as load-bearing.

Authors: We accept the criticism. The revision will include: (a) explicit train/test split ratios and randomization seed for the N=200 corpus, (b) a prompt-variation ablation table (temperature, few-shot count, and schema phrasing), and (c) a confusion-matrix breakdown stratified by printer family and material. These additions will be placed in the experimental section and will directly support the robustness statements currently only asserted in the abstract. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical accuracy measured on held-out corpus

full rationale

The abstract reports a direct empirical comparison (87.5 % vs. 59.5 %) between the three-role pipeline and a single-LLM baseline on the same N=200 held-out G-code files. No equations, fitted parameters, self-citations, or uniqueness theorems are invoked; the claimed improvement is therefore not reducible to any input by construction. The evaluation remains falsifiable by re-running the identical prompts and schemas on the released corpus.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The framework assumes that G-code can be losslessly mapped to a fixed process-parameter schema, that printer and material documentation contain complete and unambiguous operating ranges, and that deviation tables plus G-code excerpts are sufficient evidence for the Judge-LLM. No free parameters or invented physical entities are introduced.

axioms (1)

domain assumption G-code text plus documentation suffice for deterministic extraction of all relevant process parameters and ranges
Invoked by the definitions of Extractor-LLM and Reference-LLM roles.

pith-pipeline@v0.9.0 · 5625 in / 1314 out tokens · 31570 ms · 2026-05-07T02:25:28.613886+00:00 · methodology

LLM-ADAM: A Generalizable LLM Agent Framework for Pre-Print Anomaly Detection in Additive Manufacturing

Core claim

Load-bearing premise

discussion (0)