Library-Aware Doubles and Iterative Repair for Large Language Model-Generated Unit Tests in OpenSIL Firmware

Aric Leather; Haingo Razafindranto; Jitesh Arora; Ma Toan Bach; Ranveer Sandhu; Tanvir Alam; Yuchi Zheng

arxiv: 2606.19725 · v2 · pith:ZVKCN4D6new · submitted 2026-06-18 · 💻 cs.SE · cs.AI· cs.MA

Library-Aware Doubles and Iterative Repair for Large Language Model-Generated Unit Tests in OpenSIL Firmware

Ma Toan Bach , Yuchi Zheng , Haingo Razafindranto , Tanvir Alam , Aric Leather , Ranveer Sandhu , Jitesh Arora This is my paper

Pith reviewed 2026-06-26 17:03 UTC · model grok-4.3

classification 💻 cs.SE cs.AIcs.MA

keywords unit testinglarge language modelsfirmwareOpenSILiterative repairline coveragemulti-agent pipelinetest generation

0 comments

The pith

LLM multi-agent workflow with iterative repair produces compilable unit tests for 73 of 76 OpenSIL firmware functions.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Validating changes in low-level C firmware is expensive because unit tests frequently fail to compile due to missing headers, unresolved symbols, and dependency mismatches. The paper introduces an automated authoring workflow that uses a large language model multi-agent pipeline to generate test scaffolds, create or reuse library-aware stubs mocks and fakes, and run an iterative compile-dispatch repair loop guided by build logs and line-coverage feedback. Across 76 functions under test the workflow produced compilable tests for 73 functions. Mean line coverage reached 73.9 percent without guidance and 98.8 percent on a 48-function subset when line-coverage guidance was added. The results matter because they indicate a path to lower the manual debugging burden when creating unit tests under strict firmware build constraints.

Core claim

The study shows that an LLM-guided multi-agent pipeline can generate unit test scaffolds, apply library-aware doubles, and iteratively repair them via build logs and coverage feedback until the tests compile and achieve high line coverage. On 76 OpenSIL functions the pipeline produced compilable tests for 73. Without line coverage guidance mean coverage was 73.9 percent; with guidance alone on a 48-function subset it reached 98.8 percent, and 94.7 percent when combined with retrieval augmentation.

What carries the argument

The iterative compile-dispatch repair loop driven by build logs and line-coverage feedback together with library-aware creation or reuse of stubs, mocks, and fakes.

If this is right

Automated generation-and-repair pipelines can substantially improve unit test creation efficiency in constrained firmware environments.
Line-coverage guidance alone raises mean coverage to 98.8 percent on evaluated subsets.
The approach reduces manual debugging effort compared with purely manual test authoring.
Results hold across configurations with and without vector-database retrieval augmentation.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same loop structure could be tested on other low-level C libraries that share similar header and symbol constraints.
Extending the agents to handle more complex dependency graphs might increase the fraction of functions that succeed without any retrieval step.
If coverage feedback continues to drive repair, the method could be applied to legacy firmware where existing tests are sparse.

Load-bearing premise

The iterative repair loop driven by build logs and line-coverage feedback will converge to compilable high-coverage tests for most functions without human-written fixes.

What would settle it

Applying the workflow to additional OpenSIL functions or a different firmware codebase and finding that many tests still fail to compile after several repair iterations or require human intervention would falsify the central claim.

Figures

Figures reproduced from arXiv: 2606.19725 by Aric Leather, Haingo Razafindranto, Jitesh Arora, Ma Toan Bach, Ranveer Sandhu, Tanvir Alam, Yuchi Zheng.

**Figure 1.** Figure 1: Example XFER table structure. extern DF_COMMON_2_REV_XFER_BLOCK DfCmn2RevPhxXfer ; extern DF_IP2IP_API DfIp2IpApiPhx ; SIL_STATUS InitializeApiDfXPhx ( SIL_CONTEXT * SilContext ) { SIL_STATUS Status ; // Set Cmn2Rev table for DF Status = SilInitCommon2RevXferTable ( SilContext , SilId_DfClass , & DfCmn2RevPhxXfer ) ; if ( Status != SilPass ) { return Status ; } // Set Ip2Ip API for DF return SilInitIp2IpAp… view at source ↗

**Figure 2.** Figure 2: Example use of an XFER table during initialization. • JavaScript Object Notation (JSON) file. Describes the test inputs used for each test run (each iteration). • Information file (INF). Tells the EDK II build system what to compile and which packages/libraries the UT needs [4]. 2.3 Characteristics of Functions Functions in the openSIL codebase exhibit specific characteristics that influence the approach r… view at source ↗

**Figure 3.** Figure 3: Retrieving an Ip2IpApi table and calling through a function pointer. a one-shot mock helper from the UT support library, configures the next call to SilGetCommon2RevXferTable to return a test-controlled XFER table and status code. Functions that utilize Ip2Ip functionality: Some functions use an Ip2Ip structure containing function pointers (see [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 4.** Figure 4: VDB integration for retrieval: Embeddings are generated for (i) knowledge base entries, (ii) existing UTs, [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

**Figure 5.** Figure 5: 11-stage workflow. The loop is controlled by [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗

**Figure 6.** Figure 6: C source with per-line hit/miss annotations derived from LCOV, where 1 = hit and 0 = miss. [PITH_FULL_IMAGE:figures/full_fig_p012_6.png] view at source ↗

**Figure 7.** Figure 7: Iterations, time, and cost with and without LCA. [PITH_FULL_IMAGE:figures/full_fig_p015_7.png] view at source ↗

**Figure 8.** Figure 8: Distribution of time, cost, and tokens for configurations with and without LCA. [PITH_FULL_IMAGE:figures/full_fig_p015_8.png] view at source ↗

**Figure 9.** Figure 9: Time (s), cost, and total tokens by function category for the [PITH_FULL_IMAGE:figures/full_fig_p016_9.png] view at source ↗

**Figure 10.** Figure 10: Time (s), cost, and total tokens by function category for the [PITH_FULL_IMAGE:figures/full_fig_p016_10.png] view at source ↗

read the original abstract

Validating changes in low-level C firmware is expensive because unit tests (UTs) are fragile under strict build constraints, where missing headers, unresolved symbols, and dependency mismatches frequently prevent compilation and linking. This study introduces an automated UT authoring workflow for the Open-Source Silicon Initialization Library (openSIL) firmware codebase maintained by Advanced Micro Devices (AMD) that reduces manual effort through a large language model (LLM) guided multi-agent pipeline. The workflow combines automated generation of test scaffolds, library-aware creation or reuse of stubs, mocks, and fakes, and an iterative compile-dispatch repair loop driven by build logs and line-coverage feedback. We evaluate the approach using compilation success, repair iterations, dispatch success, and line coverage, with time, cost, and token usage as secondary measures. Across 76 functions under test, the workflow generated compilable UTs for 73 functions. In a configuration without line coverage guidance or retrieval augmentation, mean line coverage reached 73.9%. On a 48-function subset evaluated under both configurations, mean line coverage reached 98.8% with line-coverage guidance alone and reached 94.7% when combined with vector-database retrieval. Results show that automated generation-and-repair pipelines can substantially improve UT creation efficiency and coverage for constrained firmware environments while reducing manual debugging effort.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This is a practical engineering report on an LLM multi-agent workflow for generating unit tests in constrained firmware, with solid compilation and coverage numbers but no controls to show what the iterative repair actually adds.

read the letter

The main thing to know is that the authors built a multi-agent LLM pipeline for unit tests in AMD's openSIL firmware. It handles library-aware stubs and mocks to satisfy strict C build rules, then runs an iterative loop that feeds compile logs and line coverage back to fix the tests. On 76 functions it got 73 to compile, and on a 48-function subset line coverage hit 98.8% when coverage guidance was added.

What is actually new is the domain-specific combination: library-aware double generation plus the compile-dispatch-repair loop tuned to OpenSIL constraints. The paper reports direct measurements—compilation success, repair iterations, dispatch success, coverage, plus time and token cost—so the outcomes are not circular.

The numbers look usable for practitioners who need to reduce manual test writing in similar firmware settings. The internal comparison between configurations with and without coverage guidance is a reasonable start.

The soft spot is exactly the one in the stress-test note. All headline results come from the full workflow. There is no control arm that runs the same initial prompt and library stubs in a single pass without the agents or the repair loop. That makes it impossible to separate the contribution of the iterative machinery from what a plain LLM call plus light post-editing might achieve. The abstract also gives no statistical tests or full details on how the 76 functions were chosen and filtered.

This paper is for software-engineering readers who work on test automation for low-level C codebases or who apply LLMs to constrained build environments. It is honest applied work with measurable results on a real codebase, so it deserves a serious referee even though the baseline gap will need to be addressed.

Referee Report

3 major / 2 minor

Summary. The paper presents a multi-agent LLM workflow for generating unit tests in the OpenSIL firmware codebase. It combines test scaffold generation, library-aware stub/mock creation, and an iterative compile-dispatch-repair loop using build logs and line-coverage feedback. Empirical results on 76 functions report 73 compilable tests; mean line coverage is 73.9% without guidance and reaches 98.8% with line-coverage guidance on a 48-function subset.

Significance. If the results hold, the work addresses a practical pain point in low-level firmware validation under strict build constraints and supplies concrete, reproducible metrics (compilation counts, coverage, token/cost usage) on a real AMD-maintained codebase. These strengths make the contribution potentially useful for SE practitioners working with constrained C environments.

major comments (3)

[Evaluation] Evaluation section: the 73/76 compilation success and the 98.8% coverage figures are reported only for the full multi-agent + iterative-repair workflow. No control arm applies an otherwise identical initial prompt (same model, same library stubs, same context) in a single forward pass, so the incremental contribution of the repair loop cannot be isolated.
[Evaluation] Function selection paragraph (near the start of the evaluation): the criteria used to choose the 76 functions and the precise exclusion rules are not fully specified, which limits reproducibility and makes it impossible to judge whether the reported rates generalize beyond the selected set.
[Evaluation] 48-function subset comparison: the paper reports mean coverage of 98.8% (line-coverage guidance) versus 94.7% (with retrieval) but supplies neither per-function variance, statistical tests, nor confidence intervals, weakening the claim that guidance alone produces a reliable improvement.

minor comments (2)

The abstract states secondary measures (time, cost, token usage) but the main text should tabulate them explicitly alongside the primary metrics for each configuration.
Clarify the exact LLM model, temperature, and maximum repair iterations in the workflow description so that the experiment is fully replicable.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their insightful comments on our manuscript. We address each of the major comments below and outline the revisions we plan to make.

read point-by-point responses

Referee: [Evaluation] Evaluation section: the 73/76 compilation success and the 98.8% coverage figures are reported only for the full multi-agent + iterative-repair workflow. No control arm applies an otherwise identical initial prompt (same model, same library stubs, same context) in a single forward pass, so the incremental contribution of the repair loop cannot be isolated.

Authors: We agree that a single-pass baseline would strengthen the evaluation by isolating the repair loop's contribution. The manuscript presents the results for the complete workflow, which includes the iterative repair as an essential component for achieving compilable tests under firmware constraints. In the revision, we will expand the evaluation discussion to note the typical number of repair iterations required and the common failure modes observed in initial generations, thereby providing indirect evidence for the loop's value. revision: partial
Referee: [Evaluation] Function selection paragraph (near the start of the evaluation): the criteria used to choose the 76 functions and the precise exclusion rules are not fully specified, which limits reproducibility and makes it impossible to judge whether the reported rates generalize beyond the selected set.

Authors: We will revise the function selection paragraph to fully specify the criteria used to choose the 76 functions, including the precise exclusion rules and the rationale for selection from the OpenSIL codebase. revision: yes
Referee: [Evaluation] 48-function subset comparison: the paper reports mean coverage of 98.8% (line-coverage guidance) versus 94.7% (with retrieval) but supplies neither per-function variance, statistical tests, nor confidence intervals, weakening the claim that guidance alone produces a reliable improvement.

Authors: We agree with this observation. The revised manuscript will include per-function coverage data for the 48-function subset, along with variance measures, and we will conduct and report appropriate statistical tests with confidence intervals to substantiate the comparison. revision: yes

Circularity Check

0 steps flagged

No circularity; all claims are direct empirical measurements

full rationale

The manuscript presents an empirical workflow evaluation on a fixed set of 76 openSIL functions, reporting observed outcomes such as 73/76 compilation success and line-coverage percentages (73.9%, 98.8%, 94.7%) under different configurations. No equations, fitted parameters, predictions, or derivations appear in the text that reduce by construction to inputs, self-definitions, or self-citations. Results are measured quantities from running the described pipeline, with no load-bearing theoretical steps or renamings that would trigger any of the enumerated circularity patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The approach rests on the domain assumption that current LLMs can reliably translate compiler diagnostics and coverage reports into code edits; no free parameters, new entities, or ad-hoc axioms beyond standard LLM code-generation capabilities are introduced.

axioms (1)

domain assumption LLMs can interpret build logs and coverage reports to generate effective repairs
The iterative repair loop depends on this capability; invoked in the description of the compile-dispatch repair loop.

pith-pipeline@v0.9.1-grok · 5800 in / 1334 out tokens · 37936 ms · 2026-06-26T17:03:59.408406+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

32 extracted references · 1 canonical work pages

[1]

On the evaluation of large language models in unit test generation

Lin Yang, Chen Yang, Shutao Gao, Weijing Wang, Bo Wang, Qihao Zhu, Xiao Chu, Jianyi Zhou, Guangtai Liang, Qianxiang Wang, and Junjie Chen. On the evaluation of large language models in unit test generation. In Proceedings of the 39th IEEE/ACM International Conference on Automated Software Engineering (ASE), 2024. arXiv:2406.18181. 18

work page arXiv 2024
[2]

An evaluation of code coverage adequacy in automatic testing using control flow graph visualization

Ani Rahmani, Joe Lian Min, and Asri Maspupah. An evaluation of code coverage adequacy in automatic testing using control flow graph visualization. InProceedings of the 2020 IEEE 10th Symposium on Computer Applications and Industrial Electronics (ISCAIE), pages 239–244. IEEE, 2020

2020
[3]

Barr, Mark Harman, Phil McMinn, Muzammil Shahbaz, and Shin Yoo

Earl T. Barr, Mark Harman, Phil McMinn, Muzammil Shahbaz, and Shin Yoo. The oracle problem in software testing: A survey.IEEE Transactions on Software Engineering, 41(5):507–525, 2015

2015
[4]

EDK II build system, 2025

Tianocore. EDK II build system, 2025. Accessed: 2025-06-14

2025
[5]

Addison-Wesley Professional, 2004

Michael Feathers.Working Effectively with Legacy Code. Addison-Wesley Professional, 2004

2004
[6]

An empirical evaluation of using large language models for automated unit test generation.IEEE Transactions on Software Engineering, 50(1):85–105, January 2024

Michael Schäfer, Sarah Nadi, Armin Eghbali, and Frank Tip. An empirical evaluation of using large language models for automated unit test generation.IEEE Transactions on Software Engineering, 50(1):85–105, January 2024

2024
[7]

ASTER: Natural and multi- language unit test generation with LLMs

Rangeet Pan, Myeongsoo Kim, Rahul Krishna, Raju Pavuluri, and Saurabh Sinha. ASTER: Natural and multi- language unit test generation with LLMs. InProceedings of the IEEE/ACM 47th International Conference on Software Engineering: Software Engineering in Practice (ICSE-SEIP), pages 413–424, 2025

2025
[8]

Evaluating and improving ChatGPT for unit test generation

Zhiqiang Yuan, Mingwei Liu, Shiji Ding, Kaixin Wang, Yixuan Chen, Xin Peng, and Yiling Lou. Evaluating and improving ChatGPT for unit test generation. InProceedings of the ACM on Software Engineering (ESEC/FSE), 2024

2024
[9]

LCOV: Code coverage report generator, 2025

Linux Test Project. LCOV: Code coverage report generator, 2025. Accessed: 2025

2025
[10]

Whalen, and Mats P

Matt Staats, Michael W. Whalen, and Mats P. E. Heimdahl. Programs, tests, and oracles: The foundations of testing revisited. InProceedings of the 33rd International Conference on Software Engineering (ICSE), pages 391–400. ACM, 2011

2011
[11]

ld: The GNU linker (Options: --wrap), 2025

GNU Binutils. ld: The GNU linker (Options: --wrap), 2025. Accessed: 2025

2025
[12]

Ceedling: BDD-style unit testing for C, 2025

Ceedling. Ceedling: BDD-style unit testing for C, 2025. Accessed: 2025

2025
[13]

KLEE: Unassisted and automatic generation of high-coverage tests for complex systems programs

Cristian Cadar, Daniel Dunbar, and Dawson Engler. KLEE: Unassisted and automatic generation of high-coverage tests for complex systems programs. InProceedings of the 8th USENIX Symposium on Operating Systems Design and Implementation (OSDI), pages 209–224. USENIX, 2008

2008
[14]

An analysis and survey of the development of mutation testing.IEEE Transactions on Software Engineering, 37(5):649–678, 2011

Yue Jia and Mark Harman. An analysis and survey of the development of mutation testing.IEEE Transactions on Software Engineering, 37(5):649–678, 2011

2011
[15]

EvoSuite: Automatic test suite generation for object-oriented software

Gordon Fraser and Andrea Arcuri. EvoSuite: Automatic test suite generation for object-oriented software. In Proceedings of the 19th ACM SIGSOFT Symposium on the Foundations of Software Engineering (ESEC/FSE), pages 416–419. ACM, 2011

2011
[16]

Leveraging large language models for enhancing the understandability of generated unit tests

Amirhossein Deljouyi, Roham Koohestani, Maliheh Izadi, and Andy Zaidman. Leveraging large language models for enhancing the understandability of generated unit tests. InProceedings of the 47th IEEE/ACM International Conference on Software Engineering (ICSE), 2025

2025
[17]

Desmarais

Arghavan Moradi Dakhel, Amin Nikanjam, Vahid Majdinasab, Foutse Khomh, and Michel C. Desmarais. Effective test generation using pre-trained large language models and mutation testing.Information and Software Technology, 171:107468, July 2024

2024
[18]

Harnessing the power of LLMs: Automating unit test generation for high-performance computing, 2024

Rabimba Karanjai, Aftab Hussain, Md Rafiqul Islam Rabin, Lei Xu, Weidong Shi, and Mohammad Amin Alipour. Harnessing the power of LLMs: Automating unit test generation for high-performance computing, 2024

2024
[19]

Large language models for unit testing: A systematic literature review, 2025

Quanjun Zhang, Chunrong Fang, Siqi Gu, Ye Shang, Zhenyu Chen, and Liang Xiao. Large language models for unit testing: A systematic literature review, 2025

2025
[20]

Software testing with large language models: Survey, landscape, and vision.IEEE Transactions on Software Engineering, 50(4):911–936, April 2024

Junjie Wang, Yuchao Huang, Chunyang Chen, Zhe Liu, Song Wang, and Qing Wang. Software testing with large language models: Survey, landscape, and vision.IEEE Transactions on Software Engineering, 50(4):911–936, April 2024

2024
[21]

Large language models: A survey, 2024

Shervin Minaee, Tomás Mikolov, Narjes Nikzad, Meysam Chenaghlu, Richard Socher, Xavier Amatriain, and Jianfeng Gao. Large language models: A survey, 2024

2024
[22]

Introducing OpenAI o3 and o4-mini, 2025

OpenAI. Introducing OpenAI o3 and o4-mini, 2025. Accessed: 2025-06-26

2025
[23]

Fenton and James Bieman.Software Metrics: A Rigorous and Practical Approach

Norman E. Fenton and James Bieman.Software Metrics: A Rigorous and Practical Approach. CRC Press, 3 edition, 2014

2014
[24]

CITYW ALK: Enhancing LLM-based C++ unit test generation via project-dependency awareness and language-specific knowledge, 2025

Yuwei Zhang, Qingyuan Lu, Kai Liu, Wensheng Dou, Jiaxin Zhu, Li Qian, Chunxi Zhang, Zheng Lin, and Jun Wei. CITYW ALK: Enhancing LLM-based C++ unit test generation via project-dependency awareness and language-specific knowledge, 2025. 19

2025
[25]

Retrieval-augmented generation for knowledge-intensive NLP tasks

Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, Sebastian Riedel, and Douwe Kiela. Retrieval-augmented generation for knowledge-intensive NLP tasks. InAdvances in Neural Information Processing Systems (NeurIPS), volume 33, pages 9459–9474, 2020

2020
[26]

RepoCoder: Repository-level code completion through iterative retrieval and generation, 2023

Fengji Zhang, Bei Chen, Yue Zhang, Jacky Keung, Jin Liu, Daoguang Zan, Yi Mao, Jian-Guang Lou, and Weizhu Chen. RepoCoder: Repository-level code completion through iterative retrieval and generation, 2023

2023
[27]

CodeRAG: Finding relevant and necessary knowledge for retrieval-augmented repository-level code completion

Siyuan Zhang, Ying Ding, Shuaijun Lian, Shuai Song, and Hao Li. CodeRAG: Finding relevant and necessary knowledge for retrieval-augmented repository-level code completion. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2025

2025
[28]

Lahiri, and Sanjit Sen

Caroline Lemieux, Jeevana Priya Inala, Shuvendu K. Lahiri, and Sanjit Sen. CodaMosa: Escaping coverage plateaus in test generation with pre-trained large language models. InProceedings of the 45th IEEE/ACM International Conference on Software Engineering (ICSE), pages 919–931. IEEE, 2023

2023
[29]

Automatically finding patches using genetic programming

Westley Weimer, ThanhVu Nguyen, Claire Le Goues, and Stephanie Forrest. Automatically finding patches using genetic programming. InProceedings of the 31st International Conference on Software Engineering (ICSE), pages 364–374. IEEE, 2009

2009
[30]

Automatic software repair: A bibliography, 2018

Martin Monperrus. Automatic software repair: A bibliography, 2018

2018
[31]

Valentin J. M. Manès, HyungSeok Han, Choongwoo Han, Sang Kil Cha, Manuel Egele, Edward J. Schwartz, and Maverick Woo. The art, science, and engineering of fuzzing: A survey.IEEE Transactions on Software Engineering, 47(11):2312–2331, 2021

2021
[32]

Aider LLM leaderboards, 2025

Aider. Aider LLM leaderboards, 2025. Accessed: 2025-09-06. 20

2025

[1] [1]

On the evaluation of large language models in unit test generation

Lin Yang, Chen Yang, Shutao Gao, Weijing Wang, Bo Wang, Qihao Zhu, Xiao Chu, Jianyi Zhou, Guangtai Liang, Qianxiang Wang, and Junjie Chen. On the evaluation of large language models in unit test generation. In Proceedings of the 39th IEEE/ACM International Conference on Automated Software Engineering (ASE), 2024. arXiv:2406.18181. 18

work page arXiv 2024

[2] [2]

An evaluation of code coverage adequacy in automatic testing using control flow graph visualization

Ani Rahmani, Joe Lian Min, and Asri Maspupah. An evaluation of code coverage adequacy in automatic testing using control flow graph visualization. InProceedings of the 2020 IEEE 10th Symposium on Computer Applications and Industrial Electronics (ISCAIE), pages 239–244. IEEE, 2020

2020

[3] [3]

Barr, Mark Harman, Phil McMinn, Muzammil Shahbaz, and Shin Yoo

Earl T. Barr, Mark Harman, Phil McMinn, Muzammil Shahbaz, and Shin Yoo. The oracle problem in software testing: A survey.IEEE Transactions on Software Engineering, 41(5):507–525, 2015

2015

[4] [4]

EDK II build system, 2025

Tianocore. EDK II build system, 2025. Accessed: 2025-06-14

2025

[5] [5]

Addison-Wesley Professional, 2004

Michael Feathers.Working Effectively with Legacy Code. Addison-Wesley Professional, 2004

2004

[6] [6]

An empirical evaluation of using large language models for automated unit test generation.IEEE Transactions on Software Engineering, 50(1):85–105, January 2024

Michael Schäfer, Sarah Nadi, Armin Eghbali, and Frank Tip. An empirical evaluation of using large language models for automated unit test generation.IEEE Transactions on Software Engineering, 50(1):85–105, January 2024

2024

[7] [7]

ASTER: Natural and multi- language unit test generation with LLMs

Rangeet Pan, Myeongsoo Kim, Rahul Krishna, Raju Pavuluri, and Saurabh Sinha. ASTER: Natural and multi- language unit test generation with LLMs. InProceedings of the IEEE/ACM 47th International Conference on Software Engineering: Software Engineering in Practice (ICSE-SEIP), pages 413–424, 2025

2025

[8] [8]

Evaluating and improving ChatGPT for unit test generation

Zhiqiang Yuan, Mingwei Liu, Shiji Ding, Kaixin Wang, Yixuan Chen, Xin Peng, and Yiling Lou. Evaluating and improving ChatGPT for unit test generation. InProceedings of the ACM on Software Engineering (ESEC/FSE), 2024

2024

[9] [9]

LCOV: Code coverage report generator, 2025

Linux Test Project. LCOV: Code coverage report generator, 2025. Accessed: 2025

2025

[10] [10]

Whalen, and Mats P

Matt Staats, Michael W. Whalen, and Mats P. E. Heimdahl. Programs, tests, and oracles: The foundations of testing revisited. InProceedings of the 33rd International Conference on Software Engineering (ICSE), pages 391–400. ACM, 2011

2011

[11] [11]

ld: The GNU linker (Options: --wrap), 2025

GNU Binutils. ld: The GNU linker (Options: --wrap), 2025. Accessed: 2025

2025

[12] [12]

Ceedling: BDD-style unit testing for C, 2025

Ceedling. Ceedling: BDD-style unit testing for C, 2025. Accessed: 2025

2025

[13] [13]

KLEE: Unassisted and automatic generation of high-coverage tests for complex systems programs

Cristian Cadar, Daniel Dunbar, and Dawson Engler. KLEE: Unassisted and automatic generation of high-coverage tests for complex systems programs. InProceedings of the 8th USENIX Symposium on Operating Systems Design and Implementation (OSDI), pages 209–224. USENIX, 2008

2008

[14] [14]

An analysis and survey of the development of mutation testing.IEEE Transactions on Software Engineering, 37(5):649–678, 2011

Yue Jia and Mark Harman. An analysis and survey of the development of mutation testing.IEEE Transactions on Software Engineering, 37(5):649–678, 2011

2011

[15] [15]

EvoSuite: Automatic test suite generation for object-oriented software

Gordon Fraser and Andrea Arcuri. EvoSuite: Automatic test suite generation for object-oriented software. In Proceedings of the 19th ACM SIGSOFT Symposium on the Foundations of Software Engineering (ESEC/FSE), pages 416–419. ACM, 2011

2011

[16] [16]

Leveraging large language models for enhancing the understandability of generated unit tests

Amirhossein Deljouyi, Roham Koohestani, Maliheh Izadi, and Andy Zaidman. Leveraging large language models for enhancing the understandability of generated unit tests. InProceedings of the 47th IEEE/ACM International Conference on Software Engineering (ICSE), 2025

2025

[17] [17]

Desmarais

Arghavan Moradi Dakhel, Amin Nikanjam, Vahid Majdinasab, Foutse Khomh, and Michel C. Desmarais. Effective test generation using pre-trained large language models and mutation testing.Information and Software Technology, 171:107468, July 2024

2024

[18] [18]

Harnessing the power of LLMs: Automating unit test generation for high-performance computing, 2024

Rabimba Karanjai, Aftab Hussain, Md Rafiqul Islam Rabin, Lei Xu, Weidong Shi, and Mohammad Amin Alipour. Harnessing the power of LLMs: Automating unit test generation for high-performance computing, 2024

2024

[19] [19]

Large language models for unit testing: A systematic literature review, 2025

Quanjun Zhang, Chunrong Fang, Siqi Gu, Ye Shang, Zhenyu Chen, and Liang Xiao. Large language models for unit testing: A systematic literature review, 2025

2025

[20] [20]

Software testing with large language models: Survey, landscape, and vision.IEEE Transactions on Software Engineering, 50(4):911–936, April 2024

Junjie Wang, Yuchao Huang, Chunyang Chen, Zhe Liu, Song Wang, and Qing Wang. Software testing with large language models: Survey, landscape, and vision.IEEE Transactions on Software Engineering, 50(4):911–936, April 2024

2024

[21] [21]

Large language models: A survey, 2024

Shervin Minaee, Tomás Mikolov, Narjes Nikzad, Meysam Chenaghlu, Richard Socher, Xavier Amatriain, and Jianfeng Gao. Large language models: A survey, 2024

2024

[22] [22]

Introducing OpenAI o3 and o4-mini, 2025

OpenAI. Introducing OpenAI o3 and o4-mini, 2025. Accessed: 2025-06-26

2025

[23] [23]

Fenton and James Bieman.Software Metrics: A Rigorous and Practical Approach

Norman E. Fenton and James Bieman.Software Metrics: A Rigorous and Practical Approach. CRC Press, 3 edition, 2014

2014

[24] [24]

CITYW ALK: Enhancing LLM-based C++ unit test generation via project-dependency awareness and language-specific knowledge, 2025

Yuwei Zhang, Qingyuan Lu, Kai Liu, Wensheng Dou, Jiaxin Zhu, Li Qian, Chunxi Zhang, Zheng Lin, and Jun Wei. CITYW ALK: Enhancing LLM-based C++ unit test generation via project-dependency awareness and language-specific knowledge, 2025. 19

2025

[25] [25]

Retrieval-augmented generation for knowledge-intensive NLP tasks

Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, Sebastian Riedel, and Douwe Kiela. Retrieval-augmented generation for knowledge-intensive NLP tasks. InAdvances in Neural Information Processing Systems (NeurIPS), volume 33, pages 9459–9474, 2020

2020

[26] [26]

RepoCoder: Repository-level code completion through iterative retrieval and generation, 2023

Fengji Zhang, Bei Chen, Yue Zhang, Jacky Keung, Jin Liu, Daoguang Zan, Yi Mao, Jian-Guang Lou, and Weizhu Chen. RepoCoder: Repository-level code completion through iterative retrieval and generation, 2023

2023

[27] [27]

CodeRAG: Finding relevant and necessary knowledge for retrieval-augmented repository-level code completion

Siyuan Zhang, Ying Ding, Shuaijun Lian, Shuai Song, and Hao Li. CodeRAG: Finding relevant and necessary knowledge for retrieval-augmented repository-level code completion. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2025

2025

[28] [28]

Lahiri, and Sanjit Sen

Caroline Lemieux, Jeevana Priya Inala, Shuvendu K. Lahiri, and Sanjit Sen. CodaMosa: Escaping coverage plateaus in test generation with pre-trained large language models. InProceedings of the 45th IEEE/ACM International Conference on Software Engineering (ICSE), pages 919–931. IEEE, 2023

2023

[29] [29]

Automatically finding patches using genetic programming

Westley Weimer, ThanhVu Nguyen, Claire Le Goues, and Stephanie Forrest. Automatically finding patches using genetic programming. InProceedings of the 31st International Conference on Software Engineering (ICSE), pages 364–374. IEEE, 2009

2009

[30] [30]

Automatic software repair: A bibliography, 2018

Martin Monperrus. Automatic software repair: A bibliography, 2018

2018

[31] [31]

Valentin J. M. Manès, HyungSeok Han, Choongwoo Han, Sang Kil Cha, Manuel Egele, Edward J. Schwartz, and Maverick Woo. The art, science, and engineering of fuzzing: A survey.IEEE Transactions on Software Engineering, 47(11):2312–2331, 2021

2021

[32] [32]

Aider LLM leaderboards, 2025

Aider. Aider LLM leaderboards, 2025. Accessed: 2025-09-06. 20

2025