pith. sign in

arxiv: 2605.12516 · v1 · pith:YFWTNFBRnew · submitted 2026-04-02 · 💻 cs.CL · cs.AI

Domain Adaptation of Large Language Models for Polymer-Composite Additive Manufacturing Using Retrieval-Augmented Generation and Fine-Tuning

Pith reviewed 2026-05-14 21:49 UTC · model grok-4.3

classification 💻 cs.CL cs.AI
keywords retrieval-augmented generationdomain adaptationadditive manufacturinglarge language modelsfine-tuningpolymer compositesengineering question answering
0
0 comments X

The pith

Retrieval-augmented generation adapts large language models to additive manufacturing questions far better than fine-tuning on raw domain text.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper examines how to make general-purpose LLMs produce reliable answers in the polymer-composite additive manufacturing domain, where knowledge sits in scattered papers, standards, and manuals. It builds a curated corpus and tests three versions of LLaMA-3-8B: the untouched baseline, a RAG system that pulls relevant chunks at query time, and a model fine-tuned directly on the raw AM text. Mechanical engineering experts then judge the outputs on 200 targeted questions for accuracy, relevance, and overall preference. The RAG version wins on the large majority of questions while fine-tuning lowers performance relative to the baseline.

Core claim

A RAG system built on LLaMA-3-8B that retrieves relevant chunks from a vector database of curated additive manufacturing documents produces responses judged more accurate in 75.5 percent of cases, more relevant in 90.8 percent of cases, and preferred overall in 85.2 percent of cases compared with the pretrained baseline, whereas fine-tuning the same model on the raw domain text reduces accuracy in 94.4 percent of cases and relevance in 67.5 percent of cases.

What carries the argument

Retrieval-augmented generation that fetches relevant document chunks from a vector database of the AM corpus to condition each LLM response at inference time.

If this is right

  • Retrieval from a curated corpus supplies domain grounding without altering the base model's weights and therefore without the performance drop seen in naive fine-tuning.
  • For engineering fields whose knowledge lives in heterogeneous documents, keeping the foundation model frozen and adding external retrieval yields higher accuracy and relevance than direct parameter updates on raw text.
  • Expert preference ratings on targeted questions offer a practical yardstick for judging whether a domain-adapted LLM is actually usable by specialists.
  • Unstructured fine-tuning on technical documents risks degrading the model's ability to answer domain questions unless paired with cleaning, instruction tuning, or other safeguards.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same retrieval-first pattern could be applied to other technical domains such as aerospace certification or materials selection where knowledge is similarly fragmented across standards and reports.
  • Keeping the base model frozen while augmenting it externally may prove more scalable than repeated fine-tuning rounds when new technical documents appear.
  • Combining modest instruction tuning with the RAG pipeline could be tested next to see whether the accuracy gains compound.

Load-bearing premise

The 200 expert-designed questions together with the mechanical engineering experts' judgments give an unbiased and representative measure of answer quality in real polymer-composite additive manufacturing work.

What would settle it

A new test set of 200 questions drawn from actual production-floor logs or customer support tickets in polymer-composite AM, scored by the same expert judges, in which the fine-tuned model produces more accurate answers than the RAG model on a majority of items.

read the original abstract

General-purpose large language models (LLMs) often struggle to generate reliable responses in specialized engineering domains due to limited domain grounding and insufficient exposure to structured technical knowledge. This study investigates practical strategies for adapting a foundation LLM to the additive manufacturing (AM) domain in order to improve answer accuracy, relevance, and usability for expert-level question answering. AM knowledge is distributed across heterogeneous sources such as academic literature, manufacturer documentation, technical standards, and procedural guides. Although general LLMs demonstrate strong linguistic capabilities, they frequently fail to retrieve and contextualize such domain-specific information. Two common approaches to address this limitation are domain-specific fine-tuning and retrieval-augmented generation (RAG). We construct a curated AM corpus and evaluate three configurations based on LLaMA-3-8B: (1) the pretrained baseline model, (2) a RAG system that retrieves relevant document chunks from a vector database, and (3) a model fine-tuned on raw domain text. Performance is evaluated using 200 expert-designed AM questions assessed by mechanical engineering experts for accuracy, relevance, and overall preference. Results show that the RAG model consistently outperforms the baseline. Among the 200 questions, 75.5% of RAG responses are judged more accurate, 85.2% are preferred overall, and 90.8% are rated more relevant than baseline responses. In contrast, fine-tuning on raw AM text reduces performance, producing more accurate answers in only 5.6% of cases and more relevant answers in 32.5% of cases. These results indicate that retrieval-augmented approaches provide a more effective pathway for adapting LLMs to specialized engineering domains than naive fine-tuning on unstructured technical data.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript evaluates domain adaptation strategies for LLaMA-3-8B in the additive manufacturing (AM) domain. It compares a pretrained baseline against a RAG system retrieving from a curated AM corpus and a model fine-tuned on raw domain text. Using 200 expert-designed questions assessed by mechanical engineering experts, it reports that RAG responses are judged more accurate in 75.5% of cases, preferred overall in 85.2%, and more relevant in 90.8% compared to baseline, while fine-tuning underperforms (more accurate in only 5.6% of cases).

Significance. If the evaluation holds, the work provides concrete evidence that retrieval-augmented approaches outperform naive fine-tuning for grounding LLMs in heterogeneous technical domains such as AM, where knowledge spans literature, standards, and documentation. This has practical value for engineering applications and highlights risks of direct fine-tuning on unstructured text.

major comments (2)
  1. [Abstract and Evaluation section] Abstract and Evaluation section: The central claims rest on expert judgments of the 200 questions, yet no information is given on blinding (whether judges knew response sources), inter-rater agreement (e.g., percentage agreement or Cohen's kappa), or the protocol for question design and selection. Without these, the reported 75.5% accuracy preference cannot be confidently attributed to model differences rather than evaluation artifacts.
  2. [Methods section on corpus and fine-tuning] Methods section on corpus and fine-tuning: Corpus construction, chunking strategy for the vector database, and fine-tuning hyperparameters (learning rate, epochs, batch size, data cleaning steps) are not specified. This makes it impossible to determine whether the reported underperformance of fine-tuning (5.6% more accurate) reflects a general limitation of the approach or specific implementation choices such as lack of instruction tuning.
minor comments (2)
  1. [Title and Abstract] Title mentions polymer-composite AM but the abstract and results treat AM more broadly; clarify the exact scope of the corpus and whether results generalize beyond polymer composites.
  2. [Abstract] The abstract states performance is evaluated on accuracy, relevance, and overall preference but does not define the exact rating scales or aggregation method used by the experts.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments, which identify key gaps in methodological transparency. We address each point below and will revise the manuscript to incorporate the requested details, thereby strengthening the reproducibility and credibility of the reported results.

read point-by-point responses
  1. Referee: [Abstract and Evaluation section] Abstract and Evaluation section: The central claims rest on expert judgments of the 200 questions, yet no information is given on blinding (whether judges knew response sources), inter-rater agreement (e.g., percentage agreement or Cohen's kappa), or the protocol for question design and selection. Without these, the reported 75.5% accuracy preference cannot be confidently attributed to model differences rather than evaluation artifacts.

    Authors: We agree that these details are necessary to support the validity of the expert evaluation. The current manuscript does not include them in the Evaluation section. In the revised version we will add a dedicated subsection that describes: the blinding protocol (judges received anonymized responses with no indication of model origin and were not informed of the study design), inter-rater agreement statistics computed from the expert ratings, and the question design protocol (iterative development by three mechanical engineering experts to ensure coverage of representative AM topics). These additions will allow readers to assess potential evaluation artifacts directly. revision: yes

  2. Referee: [Methods section on corpus and fine-tuning] Methods section on corpus and fine-tuning: Corpus construction, chunking strategy for the vector database, and fine-tuning hyperparameters (learning rate, epochs, batch size, data cleaning steps) are not specified. This makes it impossible to determine whether the reported underperformance of fine-tuning (5.6% more accurate) reflects a general limitation of the approach or specific implementation choices such as lack of instruction tuning.

    Authors: We concur that the Methods section lacks the necessary implementation details. The revised manuscript will expand this section with explicit descriptions of: the corpus construction process (sources, curation criteria, and total size), the chunking strategy used to populate the vector database (including chunk size and overlap), and the complete fine-tuning configuration (optimizer, learning rate, epochs, batch size, LoRA parameters if used, and data cleaning steps). We will also note that instruction tuning was not applied, allowing readers to evaluate whether the observed fine-tuning results are attributable to this design choice or to the general approach. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical comparisons rest on external expert judgments

full rationale

The paper reports direct empirical results from three model configurations (baseline LLaMA-3-8B, RAG, and fine-tuned) evaluated on 200 expert-designed questions by mechanical engineering experts for accuracy, relevance, and preference. No equations, derivations, fitted parameters renamed as predictions, or self-citation chains appear in the methodology or results. All quantitative claims (75.5% more accurate, etc.) flow from independent human judgments rather than reducing to the paper's own inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that expert human judgments constitute a reliable proxy for answer quality in specialized engineering tasks and that the curated AM corpus adequately represents the knowledge needed for the 200 test questions.

axioms (1)
  • domain assumption Expert judgments by mechanical engineers provide an unbiased and sufficient measure of accuracy, relevance, and preference for LLM answers in the AM domain.
    The evaluation protocol depends entirely on these judgments without additional objective metrics or inter-rater statistics reported in the abstract.

pith-pipeline@v0.9.0 · 5632 in / 1346 out tokens · 43973 ms · 2026-05-14T21:49:37.779237+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

38 extracted references · 38 canonical work pages · 3 internal anchors

  1. [1]

    polymer composite,

    Methodology This section outlines the complete methodological framework used to adapt and evaluate a general-purpose LLM in the AM domain. The methodology integrates domain -specific dataset construction, parameter-efficient model adaptation, retrieval-based knowledge augmentation and expert-driven evaluation. The overall workflow includes four major stag...

  2. [2]

    Better Answer

    Results The entire evaluation focuses on comparative performance across accuracy, relevance and overall response preference. 3.1. Comparative Performance: RAG vs Baseline The comparative performance of the RAG model and the pretrained baseline model is presented in Figure 5. The evaluation is based on three criteria: overall answer preference (“Better Ans...

  3. [3]

    Discussions The results of this study demonstrate that retrieval -based augmentation is significantly more effective than direct fine -tuning for adapting large language models to specialized engineering domains. The RAG -enhanced model consistently outperforms both the baseline and fine-tuned configurations across all evaluation criteria, including accur...

  4. [4]

    The results show that retrieval-augmented generation significantly improves model performance by g rounding responses in domain -specific knowledge

    Conclusion and Future Work This study investigates the effectiveness of different adaptation strategies for applying a general-purpose LLM to additive manufacturing question answering. The results show that retrieval-augmented generation significantly improves model performance by g rounding responses in domain -specific knowledge. In contrast, fine -tuni...

  5. [5]

    LLaMA: Open and Efficient Foundation Language Models

    Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M. A., Lacroix, T., ... & Lample, G. (2023). Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971

  6. [6]

    D., Dhariwal, P.,

    Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J. D., Dhariwal, P., ... & Amodei, D. (2020). Language models are few-shot learners. Advances in neural information processing systems, 33, 1877-1901

  7. [7]

    Hendrycks, D., Burns, C., Basart, S., Zou, A., Mazeika, M., Song, D., & Steinhardt, J. (2020). Measuring massive multitask language understanding. arXiv preprint arXiv:2009.03300

  8. [8]

    Chandrasekhar, A., Chan, J., Ogoke, F., Ajenifujah, O., & Farimani, A. B. (2024). AMGPT: a large language model for contextual querying in additive manufacturing. Additive Manufacturing Letters, 11, 100232

  9. [9]

    Halsey, W., Sprayberry, M., & Paquit, V . (2025). LLMs for Mfg. -On the State of Large Language Models and Applications to Manufacturing

  10. [10]

    & Shao, C

    Eslaminia, A., Jackson, A., Tian, B., Stern, A., Gordon, H., Malhotra, R., ... & Shao, C. (2025). FDM-bench: a domain-specific benchmark for evaluating large language models in additive manufacturing. Manufacturing Letters, 44, 1415-1424

  11. [11]

    Pak, P., & Barati Farimani, A. (2025). Additivellm: Large Language Models Predict Defects in Metals Additive Manufacturing. Available at SSRN 5144227

  12. [12]

    Chen, X., Wang, L., You, M., Liu, W., Fu, Y ., Xu, J., ... & Li, J. (2024). Evaluating and enhancing large language models’ performance in domain-specific medicine: development and usability study with DocOA. Journal of medical Internet research, 26, e58158

  13. [13]

    Chen, Q., Hu, Y ., Peng, X., Xie, Q., Jin, Q., Gilson, A., ... & Xu, H. (2025). Benchmarking large language models for biomedical natural language processing applications and recommendations. Nature communications, 16(1), 3280

  14. [14]

    Zhang, X., Tian, C., Yang, X., Chen, L., Li, Z., & Petzold, L. R. (2023). Alpacare: Instruction-tuned large language models for medical application. arXiv preprint arXiv:2310.14558

  15. [15]

    Wan, Y ., Chen, Z., Liu, Y ., Chen, C., & Packianather, M. (2025). Empowering LLMs by hybrid retrieval -augmented generation for domain -centric Q&A in smart manufacturing. Advanced Engineering Informatics, 65, 103212

  16. [16]

    Du, K., Yang, B., Xie, K., Dong, N., Zhang, Z., Wang, S., & Mo, F. (2025). LLM-MANUF: An integrated framework of Fine -Tuning large language models for intelligent Decision - Making in manufacturing. Advanced Engineering Informatics, 65, 103263

  17. [17]

    Buehler, M. J. (2024). MechGPT, a language -based strategy for mechanics and materials modeling that connects knowledge across scales, disciplines, and modalities. Applied Mechanics Reviews, 76(2), 021001

  18. [18]

    A., & Cho, H

    Park, Y ., Witherell, P., Surovi, N. A., & Cho, H. (2024). Ontology -based retrieval augmented generation (rag) for genai-supported additive manufacturing

  19. [19]

    (2024, August)

    Xiong, G., Jin, Q., Lu, Z., & Zhang, A. (2024, August). Benchmarking retrieval-augmented generation for medicine. In Findings of the Association for Computational Linguistics: ACL 2024 (pp. 6233-6251)

  20. [20]

    (2025, September)

    Macháček, R., Grishina, A., Hort, M., & Moonen, L. (2025, September). The impact of fine-tuning large language models on automated program repair. In 2025 IEEE International Conference on Software Maintenance and Evolution (ICSME) (pp. 380-392). IEEE

  21. [21]

    Low-rank adaptation for foundation models: A comprehensive review.arXiv preprint arXiv:2501.00365,

    Yang, M., Chen, J., Tao, J., Zhang, Y ., Liu, J., Zhang, J., ... & Ying, R. (2024). Low-rank adaptation for foundation models: A comprehensive review. arXiv preprint arXiv:2501.00365

  22. [22]

    & Kiela, D

    Lewis, P., Perez, E., Piktus, A., Petroni, F., Karpukhin, V ., Goyal, N., ... & Kiela, D. (2020). Retrieval-augmented generation for knowledge -intensive nlp tasks. Advances in neural information processing systems, 33, 9459-9474

  23. [23]

    & Sifre, L

    Borgeaud, S., Mensch, A., Hoffmann, J., Cai, T., Rutherford, E., Millican, K., ... & Sifre, L. (2022, June). Improving language models by retrieving from trillions of tokens. In International conference on machine learning (pp. 2206-2240). PMLR

  24. [24]

    (2021, April)

    Izacard, G., & Grave, E. (2021, April). Leveraging passage retrieval with generative models for open domain question answering. In Proceedings of the 16th conference of the european chapter of the association for computational linguistics: main volume (pp. 874- 880)

  25. [25]

    (2019, November)

    Reimers, N., & Gurevych, I. (2019, November). Sentence -bert: Sentence embeddings using siamese bert-networks. In Proceedings of the 2019 conference on empirical methods in natural language processing and the 9th international joint conference on natural language processing (EMNLP-IJCNLP) (pp. 3982-3992)

  26. [26]

    Johnson, J., Douze, M., & Jégou, H. (2019). Billion -scale similarity search with GPUs. IEEE transactions on big data, 7(3), 535-547

  27. [27]

    The Llama 3 Herd of Models

    Grattafiori, A., Dubey, A., Jauhri, A., Pandey, A., Kadian, A., Al-Dahle, A., ... & Vasic, P. (2024). The llama 3 herd of models. arXiv preprint arXiv:2407.21783

  28. [28]

    To assess the potential and capabilities of large language models (LLMs) trained on in-domain ophthalmology data,

    M. N. Alam, T. Haghighi, S. Gholami, and T. Leng, "To assess the potential and capabilities of large language models (LLMs) trained on in-domain ophthalmology data," Investigative Ophthalmology & Visual Science, vol. 65, no. 7, pp. 5656-5656, 2024

  29. [29]

    Dettmers, T., Pagnoni, A., Holtzman, A., & Zettlemoyer, L. (2023). Qlora: Efficient finetuning of quantized llms. Advances in neural information processing systems , 36, 10088-10115

  30. [30]

    T., Kishnani, E., Ahsaniyan, A., Rahmanian, H.,

    Haghighi, T., Gholami, S., Sokol, J. T., Kishnani, E., Ahsaniyan, A., Rahmanian, H., ... & Alam, M. N. (2025). EYE -Llama, an in -domain large language model for ophthalmology. Iscience, 28(7)

  31. [31]

    & Yih, W

    Karpukhin, V ., Oguz, B., Min, S., Lewis, P., Wu, L., Edunov, S., ... & Yih, W. T. (2020, November). Dense passage retrieval for open-domain question answering. In Proceedings of the 2020 conference on empirical methods in natural language processing (EMNLP) (pp. 6769-6781)

  32. [32]

    (2016, November)

    Rajpurkar, P., Zhang, J., Lopyrev, K., & Liang, P. (2016, November). Squad: 100,000+ questions for machine comprehension of text. In Proceedings of the 2016 conference on empirical methods in natural language processing (pp. 2383-2392)

  33. [33]

    & Bowman, S

    Wang, A., Pruksachatkun, Y ., Nangia, N., Singh, A., Michael, J., Hill, F., ... & Bowman, S. (2019). Superglue: A stickier benchmark for general -purpose language understanding systems. Advances in neural information processing systems, 32

  34. [34]

    (2020, July)

    Maynez, J., Narayan, S., Bohnet, B., & McDonald, R. (2020, July). On faithfulness and factuality in abstractive summarization. In Proceedings of the 58th annual meeting of the association for computational linguistics (pp. 1906-1919)

  35. [35]

    (2022, May)

    Lin, S., Hilton, J., & Evans, O. (2022, May). Truthfulqa: Measuring how models mimic human falsehoods. In Proceedings of the 60th annual meeting of the association for computational linguistics (volume 1: long papers) (pp. 3214-3252)

  36. [36]

    S., & Wong, K

    Chen, L., Deng, Y ., Bian, Y ., Qin, Z., Wu, B., Chua, T. S., & Wong, K. F. (2023, December). Beyond factuality: A comprehensive evaluation of large language models as knowledge generators. In Proceedings of the 2023 conference on empirical methods in natural language processing (pp. 6325-6341)

  37. [37]

    W., Hou, L., Longpre, S., Zoph, B., Tay, Y ., Fedus, W.,

    Chung, H. W., Hou, L., Longpre, S., Zoph, B., Tay, Y ., Fedus, W., ... & Wei, J. (2024). Scaling instruction -finetuned language models. Journal of Machine Learning Research, 25(70), 1-53

  38. [38]

    Revisiting few-sample bert fine-tuning

    Zhang, T., Wu, F., Katiyar, A., Weinberger, K. Q., & Artzi, Y . (2020). Revisiting few - sample BERT fine-tuning. arXiv preprint arXiv:2006.05987