Document Retrieval Augmented Fine-Tuning (DRAFT) for safety-critical software assessments

Dan Basher; Gary Bamford; Howard Parkinson; Mohammadreza Sheikhfathollahi; Regan Bolton; Simon Parkinson; Vanessa Vulovic

arxiv: 2505.01307 · v1 · submitted 2025-05-02 · 💻 cs.SE · cs.AI

Document Retrieval Augmented Fine-Tuning (DRAFT) for safety-critical software assessments

Regan Bolton , Mohammadreza Sheikhfathollahi , Simon Parkinson , Vanessa Vulovic , Gary Bamford , Dan Basher , Howard Parkinson This is my paper

Pith reviewed 2026-05-22 17:06 UTC · model grok-4.3

classification 💻 cs.SE cs.AI

keywords document retrievalfine-tuningsafety-critical softwareregulatory compliancelarge language modelsretrieval-augmented generationcompliance assessmentsoftware safety

0 comments

The pith

A fine-tuning approach called DRAFT improves large language model accuracy on safety-critical software compliance checks by seven percent when paired with dual retrieval of documentation and standards.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops Document Retrieval-Augmented Fine-Tuning to help large language models evaluate software against regulatory requirements more reliably than standard models or basic retrieval methods allow. It pairs a retrieval system that pulls both from the software's own documents and from applicable standards with a fine-tuning stage trained on datasets that mix relevant passages with realistic distractors. This setup is tested on GPT-4o-mini and produces a measured seven percent rise in correct answers plus clearer use of evidence and more appropriate domain reasoning. The result matters because manual regulatory reviews are slow and error-prone, while purely generative models often lack traceable grounding in the required sources.

Core claim

DRAFT extends retrieval-augmented generation by adding a fine-tuning framework built around a dual-retrieval architecture that simultaneously fetches software documentation and reference standards. Training data are produced through a semi-automated process that varies the number of relevant documents and inserts meaningful distractors to match real assessment conditions. When applied to GPT-4o-mini the method yields a seven percent improvement in correctness over the untuned baseline together with gains in evidence handling, response structure, and domain-specific reasoning while preserving the transparency needed for regulatory use.

What carries the argument

The dual-retrieval architecture that simultaneously accesses software documentation and applicable reference standards, embedded inside a fine-tuning process that trains on datasets containing both relevant and distracting documents.

If this is right

Correctness on compliance questions rises by seven percent relative to the baseline model.
Model outputs show improved evidence handling, more coherent response structure, and stronger domain-specific reasoning.
The approach maintains the transparency and traceable justification required in regulatory domains.
Compliance assessment systems gain a practical route to handling complex regulatory frameworks at larger scale than manual review permits.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same dual-retrieval plus fine-tuning pattern could transfer to other regulated domains such as medical device or aviation safety reviews whenever both product documentation and external standards are available.
Training with realistic distractors may reduce the model's tendency to over-rely on superficial matches, an effect that could be measured by tracking how often irrelevant passages are cited in answers.
Replacing the semi-automated dataset step with fully human-curated examples would test whether the reported gains depend on the quality of the distractor selection process.

Load-bearing premise

The semi-automated dataset generation creates training examples with variable numbers of relevant documents and meaningful distractors that closely match the distribution of real-world regulatory assessment tasks.

What would settle it

Evaluating the fine-tuned model against a set of previously unseen safety-critical software assessments that carry independently verified correct compliance judgments and finding no statistically significant accuracy gain over the baseline model would falsify the reported improvement.

Figures

Figures reproduced from arXiv: 2505.01307 by Dan Basher, Gary Bamford, Howard Parkinson, Mohammadreza Sheikhfathollahi, Regan Bolton, Simon Parkinson, Vanessa Vulovic.

**Figure 2.** Figure 2: Prompt template used to generate an answer in the fine tuning dataset included at all P% of the time, our approach maintains at least one golden document (m ≥ 1) in every training instance. For our use case, at least one authoritative source document is typically necessary to correctly answer a question. Furthermore, the optimal “P value” was irregular and only provided marginal performance gains in the RA… view at source ↗

**Figure 3.** Figure 3: 4o-mini model: Full visualization of training and validation loss across all 685 records. The blue line shows the training loss (sampled every 5 points for clarity), while the red line with markers shows validation loss measurements taken at every 10th record. Note the significant decline in both losses during the first 100 records and the stabilization after approximately record 300. Hyperparameter Value … view at source ↗

read the original abstract

Safety critical software assessment requires robust assessment against complex regulatory frameworks, a process traditionally limited by manual evaluation. This paper presents Document Retrieval-Augmented Fine-Tuning (DRAFT), a novel approach that enhances the capabilities of a large language model (LLM) for safety-critical compliance assessment. DRAFT builds upon existing Retrieval-Augmented Generation (RAG) techniques by introducing a novel fine-tuning framework that accommodates our dual-retrieval architecture, which simultaneously accesses both software documentation and applicable reference standards. To fine-tune DRAFT, we develop a semi-automated dataset generation methodology that incorporates variable numbers of relevant documents with meaningful distractors, closely mirroring real-world assessment scenarios. Experiments with GPT-4o-mini demonstrate a 7% improvement in correctness over the baseline model, with qualitative improvements in evidence handling, response structure, and domain-specific reasoning. DRAFT represents a practical approach to improving compliance assessment systems while maintaining the transparency and evidence-based reasoning essential in regulatory domains.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

DRAFT adds a dual-retrieval fine-tuning step and a distractor-aware dataset recipe for compliance assessment, but the 7% gain rests on unvalidated synthetic data.

read the letter

The paper's main contribution is a fine-tuning method called DRAFT that lets an LLM retrieve from both software documentation and regulatory standards at the same time, then trains on a semi-automated dataset that mixes relevant documents with distractors. They report a 7% correctness improvement over a plain GPT-4o-mini baseline plus some qualitative gains in how the model cites evidence and reasons about domain rules. That setup is a reasonable extension of existing RAG work to a setting where missing or misreading a standard can matter a lot. The dataset construction, with its variable document counts and deliberate distractors, is the part that feels most tailored to the compliance use case and could be reusable by others facing similar regulatory text problems. Credit to the authors for focusing on evidence-based output rather than just accuracy scores. The evaluation details are thin. The abstract gives the 7% figure but does not describe the baseline model, the size or composition of the test set, or any statistical test. More importantly, there is no reported check that the generated distractors or the distribution of document counts actually resemble the noise and cross-references found in real safety assessments. If the synthetic cases are easier or less entangled than genuine regulatory material, the measured lift could shrink or disappear on live data. That is the central risk. Readers working on applied AI for regulated software engineering will get the most from this, especially those already experimenting with retrieval-augmented systems for standards compliance. The work is coherent on its own terms and shows honest attention to the domain constraints, so it is worth sending out for peer review. Referees will likely press on the dataset validation and the exact experimental controls, but the practical framing gives it enough substance to justify the time.

Referee Report

2 major / 1 minor

Summary. The paper proposes Document Retrieval-Augmented Fine-Tuning (DRAFT), a framework that augments LLMs with a dual-retrieval architecture accessing both software documentation and regulatory standards, combined with fine-tuning on a semi-automated dataset containing variable numbers of relevant documents and meaningful distractors. Experiments using GPT-4o-mini report a 7% improvement in correctness over a baseline model, accompanied by qualitative gains in evidence handling, response structure, and domain-specific reasoning for safety-critical compliance assessments.

Significance. If the empirical gains hold under rigorous validation, DRAFT could offer a practical, evidence-preserving method for improving automated compliance assessment in safety-critical software, an area where manual processes are costly and error-prone. The combination of retrieval and targeted fine-tuning addresses a real need for transparency in regulatory domains, and the semi-automated data generation approach is a potentially reusable contribution if its fidelity to real scenarios is demonstrated.

major comments (2)

[Abstract] Abstract and methodology description: the central claim of a 7% correctness improvement is presented without any reported details on baseline definition, number of test instances, statistical tests, confidence intervals, or error analysis. This information is load-bearing for interpreting whether the gain reflects genuine advances in evidence handling rather than experimental artifacts.
[Dataset Generation Methodology] Dataset generation section: the semi-automated methodology is described as incorporating variable relevant documents and meaningful distractors that 'closely mirror real-world assessment scenarios,' yet no quantitative validation (expert ratings, inter-annotator agreement, or side-by-side comparison against actual regulatory logs) is supplied. Without such checks, the measured improvement cannot be confidently attributed to better domain reasoning rather than properties of the synthetic distribution.

minor comments (1)

[Abstract] The abstract refers to a 'dual-retrieval architecture' without a concise diagram or pseudocode; adding one would clarify how the simultaneous access to documentation and standards is implemented during both retrieval and fine-tuning.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive feedback on our manuscript. We address each major comment below and describe the revisions we will incorporate to strengthen the paper.

read point-by-point responses

Referee: [Abstract] Abstract and methodology description: the central claim of a 7% correctness improvement is presented without any reported details on baseline definition, number of test instances, statistical tests, confidence intervals, or error analysis. This information is load-bearing for interpreting whether the gain reflects genuine advances in evidence handling rather than experimental artifacts.

Authors: We agree that the abstract would benefit from greater specificity on the evaluation protocol. In the revised manuscript we will expand the abstract to explicitly state the baseline model definition, the number of test instances, and the statistical significance of the observed improvement. We will also add a dedicated error analysis subsection in the experiments section that reports confidence intervals and breaks down failure modes by evidence-handling, distractor sensitivity, and domain-reasoning categories. revision: yes
Referee: [Dataset Generation Methodology] Dataset generation section: the semi-automated methodology is described as incorporating variable relevant documents and meaningful distractors that 'closely mirror real-world assessment scenarios,' yet no quantitative validation (expert ratings, inter-annotator agreement, or side-by-side comparison against actual regulatory logs) is supplied. Without such checks, the measured improvement cannot be confidently attributed to better domain reasoning rather than properties of the synthetic distribution.

Authors: We acknowledge that the current description relies on qualitative design choices rather than quantitative validation of the synthetic dataset. In the revision we will add a new subsection reporting expert ratings (on a 5-point Likert scale for realism and distractor quality) together with inter-annotator agreement statistics. Where feasible we will also include a small side-by-side comparison against a sample of real regulatory assessment logs to further substantiate the claim that the generated distribution mirrors operational conditions. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical result on generated dataset

full rationale

The paper advances an empirical method (DRAFT fine-tuning on a semi-automated dataset with variable relevant documents and distractors) and reports a measured 7% correctness gain from GPT-4o-mini experiments. No derivation chain, equations, or first-principles steps exist that could reduce a prediction to its inputs by construction. The central claim rests on experimental outcomes rather than self-definitional terms, fitted parameters renamed as predictions, or load-bearing self-citations. The methodology is presented as mirroring real scenarios but is not itself derived from the result it evaluates, leaving the evaluation self-contained.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The paper relies on the standard assumption that LLMs can be fine-tuned to improve performance on domain-specific retrieval and reasoning tasks; it introduces the DRAFT method itself as the primary new element without listing explicit free parameters or additional invented physical entities.

axioms (1)

domain assumption Large language models can be fine-tuned using retrieval-augmented examples to improve correctness and reasoning on regulatory compliance tasks
This assumption underpins the decision to apply fine-tuning on top of the dual-retrieval architecture.

invented entities (1)

DRAFT framework no independent evidence
purpose: A fine-tuning approach that accommodates dual retrieval from software documentation and reference standards
Newly proposed method whose performance is evaluated in the experiments.

pith-pipeline@v0.9.0 · 5712 in / 1408 out tokens · 106196 ms · 2026-05-22T17:06:13.987557+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

DRAFT builds upon existing Retrieval-Augmented Generation (RAG) techniques by introducing a novel fine-tuning framework that accommodates our dual-retrieval architecture... semi-automated dataset generation methodology that incorporates variable numbers of relevant documents with meaningful distractors
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We used Low-Rank Adaptation (LoRA) for fine-tuning... Experiments with GPT-4o-mini demonstrate a 7% improvement in correctness

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

28 extracted references · 28 canonical work pages · 4 internal anchors

[1]

Aghajanyan, L

A. Aghajanyan, L. Zettlemoyer, and S. Gupta. Intrinsic d imensionality explains the effectiveness of language model ﬁne-tuning. arXiv preprint arXiv:2012.13255, 2020

work page arXiv 2012
[2]

Balaguer, V

A. Balaguer, V . Benara, R. L. d. F. Cunha, T. Hendry, D. Hol stein, J. Marsman, N. Mecklenburg, S. Malvar, L. O. Nunes, R. Padilh a, et al. Rag vs ﬁne-tuning: Pipelines, tradeoffs, and a case study on agriculture. arXiv preprint arXiv:2401.08406, 2024

work page arXiv 2024
[3]

Barnett, S

S. Barnett, S. Kurniawan, S. Thudumu, Z. Brannelly, and M . Abdel- razek. Seven failure points when engineering a retrieval au gmented generation system. In Proceedings of the IEEE/ACM 3rd International Conference on AI Engineering-Software Engineering for AI, pages 194– 199, 2024

work page 2024
[4]

Multi-Stage Retrieval for Operational Technology Cybersecurity Compliance Using Large Language Models: A Railway Casestudy

R. Bolton, M. Sheikhfathollahi, S. Parkinson, D. Basher , and H. Parkin- son. Multi-stage retrieval for operational technology cyb ersecurity compliance using large language models: A railway casestud y, 2025. URL https://arxiv.org/abs/2504.14044

work page internal anchor Pith review Pith/arXiv arXiv 2025
[5]

Brown, B

T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P . Dh ariwal, A. Neelakantan, P . Shyam, G. Sastry, A. Askell, et al. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020

work page 1901
[6]

Darji, F

A. Darji, F. Kheni, D. Chodvadia, P . Goel, D. Garg, and B. P atel. En- hancing ﬁnancial risk analysis using rag-based large langu age models. In 2024 3rd International Conference on Automation, Computin g and Renewable Systems (ICACRS), pages 754–760. IEEE, 2024

work page 2024
[7]

Y . Gao, Y . Xiong, X. Gao, K. Jia, J. Pan, Y . Bi, Y . Dai, J. Sun, H. Wang, and H. Wang. Retrieval-augmented generation for large lang uage models: A survey. arXiv preprint arXiv:2312.10997, 2, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[8]

Hayou, N

S. Hayou, N. Ghosh, and B. Y u. Lora+: Efﬁcient low rank ada ptation of large models. arXiv preprint arXiv:2402.12354, 2024

work page arXiv 2024
[9]

H. He, P . Y e, Y . Ren, Y . Y uan, and L. Chen. Gora: Gradient-d riven adaptive low rank adaptation. arXiv preprint arXiv:2502.12171, 2025

work page arXiv 2025
[10]

E. J. Hu, Y . Shen, P . Wallis, Z. Allen-Zhu, Y . Li, S. Wang, L. Wang, W. Chen, et al. Lora: Low-rank adaptation of large language m odels. ICLR, 1(2):3, 2022

work page 2022
[11]

Y . Hui, Y . Lu, and H. Zhang. Uda: A benchmark suite for ret rieval augmented generation in real-world document analysis. arXiv preprint arXiv:2406.15187, 2024

work page arXiv 2024
[12]

GPT-4o System Card

A. Hurst, A. Lerer, A. P . Goucher, A. Perelman, A. Ramesh , A. Clark, A. Ostrow, A. Welihinda, A. Hayes, A. Radford, et al. Gpt-4o s ystem card. arXiv preprint arXiv:2410.21276, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[13]

J. Knight. Safety critical systems: challenges and dir ections. In Pro- ceedings of the 24th International Conference on Software Engineering. ICSE 2002, pages 547–550, 2002

work page 2002
[14]

Kornecki and J

A. Kornecki and J. Zalewski. Certiﬁcation of software f or real-time safety-critical systems: state of the art. Innovations in Systems and Software Engineering, 5:149–161, 2009

work page 2009
[15]

Lakatos, P

R. Lakatos, P . Pollner, A. Hajdu, and T. Joo. Investigat ing the per- formance of retrieval-augmented generation and ﬁne-tunin g for the development of ai-driven knowledge-based systems. arXiv preprint arXiv:2403.09727, 2024

work page arXiv 2024
[16]

N. G. Leveson. Engineering a safer world: Systems thinking applied to safety. The MIT Press, 2016

work page 2016
[17]

Lewis, E

P . Lewis, E. Perez, A. Piktus, F. Petroni, V . Karpukhin, N. Goyal, H. Küttler, M. Lewis, W.-t. Yih, T. Rocktäschel, et al. Retri eval- augmented generation for knowledge-intensive nlp tasks. Advances in neural information processing systems, 33:9459–9474, 2020

work page 2020
[18]

M. Li, W. M. Si, M. Backes, Y . Zhang, and Y . Wang. Sa- lora: Safety-alignment preserved low-rank adaptation. arXiv preprint arXiv:2501.01765, 2025

work page arXiv 2025
[19]

Z. Liu, W. Ping, R. Roy, P . Xu, C. Lee, M. Shoeybi, and B. Ca tanzaro. Chatqa: Surpassing gpt-4 on conversational qa and rag. 2024 . URL https://api. semanticscholar . org/CorpusID, 267035133, 2024

work page 2024
[20]

M. B. Munir, Y . Cai, L. Khan, and B. Thuraisingham. Lever aging multimodal retrieval-augmented generation for cyber atta ck detection in transit systems. In 2024 IEEE 6th International Conference on Trust, Privacy and Security in Intelligent Systems, and Applicati ons (TPS- ISA), pages 341–350. IEEE, 2024

work page 2024
[21]

Patil and V

R. Patil and V . Gudivada. A review of current trends, tec hniques, and challenges in large language models (llms). Applied Sciences , 14(5): 2074, 2024

work page 2074
[22]

J. Rushby. Software veriﬁcation and system assurance. In 2009 Seventh IEEE International Conference on Software Engineering and F ormal Methods, pages 3–10. IEEE, 2009

work page 2009
[23]

W. Shi, Y . Zhuang, Y . Zhu, H. Iwinski, M. Wattenbarger, a nd M. D. Wang. Retrieval-augmented large language models for adole scent idiopathic scoliosis patients in shared decision-making. In Proceedings of the 14th ACM International Conference on Bioinformatics , Compu- tational Biology, and Health Informatics , pages 1–10, 2023

work page 2023
[24]

Soudani, E

H. Soudani, E. Kanoulas, and F. Hasibi. Fine tuning vs. r etrieval augmented generation for less popular knowledge. In Proceedings of the 2024 Annual International ACM SIGIR Conference on Resea rch and Development in Information Retrieval in the Asia Paciﬁc Region, pages 12–22, 2024

work page 2024
[25]

C. Su, J. Wen, J. Kang, Y . Wang, H. Pan, and M. S. Hossain. H ybrid rag-empowered multi-modal llm for secure healthcare data m anage- ment: A diffusion-based contract theory approach. arXiv preprint arXiv:2407.00978, 2024

work page arXiv 2024
[26]

Zhang, Z

B. Zhang, Z. Liu, C. Cherry, and O. Firat. When scaling me ets llm ﬁnetuning: The effect of data, model and ﬁnetuning method. arXiv preprint arXiv:2402.17193, 2024

work page arXiv 2024
[27]

Zhang, S

T. Zhang, S. G. Patil, N. Jain, S. Shen, M. Zaharia, I. Sto ica, and J. E. Gonzalez. Raft: Adapting language model to domain speciﬁc r ag. In First Conference on Language Modeling, 2024

work page 2024
[28]

P . Zhao, H. Zhang, Q. Y u, Z. Wang, Y . Geng, F. Fu, L. Y ang, W. Zhang, J. Jiang, and B. Cui. Retrieval-augmented generation for ai -generated content: A survey. arxiv 2024. arXiv preprint arXiv:2402.19473, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[1] [1]

Aghajanyan, L

A. Aghajanyan, L. Zettlemoyer, and S. Gupta. Intrinsic d imensionality explains the effectiveness of language model ﬁne-tuning. arXiv preprint arXiv:2012.13255, 2020

work page arXiv 2012

[2] [2]

Balaguer, V

A. Balaguer, V . Benara, R. L. d. F. Cunha, T. Hendry, D. Hol stein, J. Marsman, N. Mecklenburg, S. Malvar, L. O. Nunes, R. Padilh a, et al. Rag vs ﬁne-tuning: Pipelines, tradeoffs, and a case study on agriculture. arXiv preprint arXiv:2401.08406, 2024

work page arXiv 2024

[3] [3]

Barnett, S

S. Barnett, S. Kurniawan, S. Thudumu, Z. Brannelly, and M . Abdel- razek. Seven failure points when engineering a retrieval au gmented generation system. In Proceedings of the IEEE/ACM 3rd International Conference on AI Engineering-Software Engineering for AI, pages 194– 199, 2024

work page 2024

[4] [4]

Multi-Stage Retrieval for Operational Technology Cybersecurity Compliance Using Large Language Models: A Railway Casestudy

R. Bolton, M. Sheikhfathollahi, S. Parkinson, D. Basher , and H. Parkin- son. Multi-stage retrieval for operational technology cyb ersecurity compliance using large language models: A railway casestud y, 2025. URL https://arxiv.org/abs/2504.14044

work page internal anchor Pith review Pith/arXiv arXiv 2025

[5] [5]

Brown, B

T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P . Dh ariwal, A. Neelakantan, P . Shyam, G. Sastry, A. Askell, et al. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020

work page 1901

[6] [6]

Darji, F

A. Darji, F. Kheni, D. Chodvadia, P . Goel, D. Garg, and B. P atel. En- hancing ﬁnancial risk analysis using rag-based large langu age models. In 2024 3rd International Conference on Automation, Computin g and Renewable Systems (ICACRS), pages 754–760. IEEE, 2024

work page 2024

[7] [7]

Y . Gao, Y . Xiong, X. Gao, K. Jia, J. Pan, Y . Bi, Y . Dai, J. Sun, H. Wang, and H. Wang. Retrieval-augmented generation for large lang uage models: A survey. arXiv preprint arXiv:2312.10997, 2, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[8] [8]

Hayou, N

S. Hayou, N. Ghosh, and B. Y u. Lora+: Efﬁcient low rank ada ptation of large models. arXiv preprint arXiv:2402.12354, 2024

work page arXiv 2024

[9] [9]

H. He, P . Y e, Y . Ren, Y . Y uan, and L. Chen. Gora: Gradient-d riven adaptive low rank adaptation. arXiv preprint arXiv:2502.12171, 2025

work page arXiv 2025

[10] [10]

E. J. Hu, Y . Shen, P . Wallis, Z. Allen-Zhu, Y . Li, S. Wang, L. Wang, W. Chen, et al. Lora: Low-rank adaptation of large language m odels. ICLR, 1(2):3, 2022

work page 2022

[11] [11]

Y . Hui, Y . Lu, and H. Zhang. Uda: A benchmark suite for ret rieval augmented generation in real-world document analysis. arXiv preprint arXiv:2406.15187, 2024

work page arXiv 2024

[12] [12]

GPT-4o System Card

A. Hurst, A. Lerer, A. P . Goucher, A. Perelman, A. Ramesh , A. Clark, A. Ostrow, A. Welihinda, A. Hayes, A. Radford, et al. Gpt-4o s ystem card. arXiv preprint arXiv:2410.21276, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[13] [13]

J. Knight. Safety critical systems: challenges and dir ections. In Pro- ceedings of the 24th International Conference on Software Engineering. ICSE 2002, pages 547–550, 2002

work page 2002

[14] [14]

Kornecki and J

A. Kornecki and J. Zalewski. Certiﬁcation of software f or real-time safety-critical systems: state of the art. Innovations in Systems and Software Engineering, 5:149–161, 2009

work page 2009

[15] [15]

Lakatos, P

R. Lakatos, P . Pollner, A. Hajdu, and T. Joo. Investigat ing the per- formance of retrieval-augmented generation and ﬁne-tunin g for the development of ai-driven knowledge-based systems. arXiv preprint arXiv:2403.09727, 2024

work page arXiv 2024

[16] [16]

N. G. Leveson. Engineering a safer world: Systems thinking applied to safety. The MIT Press, 2016

work page 2016

[17] [17]

Lewis, E

P . Lewis, E. Perez, A. Piktus, F. Petroni, V . Karpukhin, N. Goyal, H. Küttler, M. Lewis, W.-t. Yih, T. Rocktäschel, et al. Retri eval- augmented generation for knowledge-intensive nlp tasks. Advances in neural information processing systems, 33:9459–9474, 2020

work page 2020

[18] [18]

M. Li, W. M. Si, M. Backes, Y . Zhang, and Y . Wang. Sa- lora: Safety-alignment preserved low-rank adaptation. arXiv preprint arXiv:2501.01765, 2025

work page arXiv 2025

[19] [19]

Z. Liu, W. Ping, R. Roy, P . Xu, C. Lee, M. Shoeybi, and B. Ca tanzaro. Chatqa: Surpassing gpt-4 on conversational qa and rag. 2024 . URL https://api. semanticscholar . org/CorpusID, 267035133, 2024

work page 2024

[20] [20]

M. B. Munir, Y . Cai, L. Khan, and B. Thuraisingham. Lever aging multimodal retrieval-augmented generation for cyber atta ck detection in transit systems. In 2024 IEEE 6th International Conference on Trust, Privacy and Security in Intelligent Systems, and Applicati ons (TPS- ISA), pages 341–350. IEEE, 2024

work page 2024

[21] [21]

Patil and V

R. Patil and V . Gudivada. A review of current trends, tec hniques, and challenges in large language models (llms). Applied Sciences , 14(5): 2074, 2024

work page 2074

[22] [22]

J. Rushby. Software veriﬁcation and system assurance. In 2009 Seventh IEEE International Conference on Software Engineering and F ormal Methods, pages 3–10. IEEE, 2009

work page 2009

[23] [23]

W. Shi, Y . Zhuang, Y . Zhu, H. Iwinski, M. Wattenbarger, a nd M. D. Wang. Retrieval-augmented large language models for adole scent idiopathic scoliosis patients in shared decision-making. In Proceedings of the 14th ACM International Conference on Bioinformatics , Compu- tational Biology, and Health Informatics , pages 1–10, 2023

work page 2023

[24] [24]

Soudani, E

H. Soudani, E. Kanoulas, and F. Hasibi. Fine tuning vs. r etrieval augmented generation for less popular knowledge. In Proceedings of the 2024 Annual International ACM SIGIR Conference on Resea rch and Development in Information Retrieval in the Asia Paciﬁc Region, pages 12–22, 2024

work page 2024

[25] [25]

C. Su, J. Wen, J. Kang, Y . Wang, H. Pan, and M. S. Hossain. H ybrid rag-empowered multi-modal llm for secure healthcare data m anage- ment: A diffusion-based contract theory approach. arXiv preprint arXiv:2407.00978, 2024

work page arXiv 2024

[26] [26]

Zhang, Z

B. Zhang, Z. Liu, C. Cherry, and O. Firat. When scaling me ets llm ﬁnetuning: The effect of data, model and ﬁnetuning method. arXiv preprint arXiv:2402.17193, 2024

work page arXiv 2024

[27] [27]

Zhang, S

T. Zhang, S. G. Patil, N. Jain, S. Shen, M. Zaharia, I. Sto ica, and J. E. Gonzalez. Raft: Adapting language model to domain speciﬁc r ag. In First Conference on Language Modeling, 2024

work page 2024

[28] [28]

P . Zhao, H. Zhang, Q. Y u, Z. Wang, Y . Geng, F. Fu, L. Y ang, W. Zhang, J. Jiang, and B. Cui. Retrieval-augmented generation for ai -generated content: A survey. arxiv 2024. arXiv preprint arXiv:2402.19473, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024