pith. sign in

arxiv: 2606.19725 · v2 · pith:ZVKCN4D6new · submitted 2026-06-18 · 💻 cs.SE · cs.AI· cs.MA

Library-Aware Doubles and Iterative Repair for Large Language Model-Generated Unit Tests in OpenSIL Firmware

Pith reviewed 2026-06-26 17:03 UTC · model grok-4.3

classification 💻 cs.SE cs.AIcs.MA
keywords unit testinglarge language modelsfirmwareOpenSILiterative repairline coveragemulti-agent pipelinetest generation
0
0 comments X

The pith

LLM multi-agent workflow with iterative repair produces compilable unit tests for 73 of 76 OpenSIL firmware functions.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Validating changes in low-level C firmware is expensive because unit tests frequently fail to compile due to missing headers, unresolved symbols, and dependency mismatches. The paper introduces an automated authoring workflow that uses a large language model multi-agent pipeline to generate test scaffolds, create or reuse library-aware stubs mocks and fakes, and run an iterative compile-dispatch repair loop guided by build logs and line-coverage feedback. Across 76 functions under test the workflow produced compilable tests for 73 functions. Mean line coverage reached 73.9 percent without guidance and 98.8 percent on a 48-function subset when line-coverage guidance was added. The results matter because they indicate a path to lower the manual debugging burden when creating unit tests under strict firmware build constraints.

Core claim

The study shows that an LLM-guided multi-agent pipeline can generate unit test scaffolds, apply library-aware doubles, and iteratively repair them via build logs and coverage feedback until the tests compile and achieve high line coverage. On 76 OpenSIL functions the pipeline produced compilable tests for 73. Without line coverage guidance mean coverage was 73.9 percent; with guidance alone on a 48-function subset it reached 98.8 percent, and 94.7 percent when combined with retrieval augmentation.

What carries the argument

The iterative compile-dispatch repair loop driven by build logs and line-coverage feedback together with library-aware creation or reuse of stubs, mocks, and fakes.

If this is right

  • Automated generation-and-repair pipelines can substantially improve unit test creation efficiency in constrained firmware environments.
  • Line-coverage guidance alone raises mean coverage to 98.8 percent on evaluated subsets.
  • The approach reduces manual debugging effort compared with purely manual test authoring.
  • Results hold across configurations with and without vector-database retrieval augmentation.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same loop structure could be tested on other low-level C libraries that share similar header and symbol constraints.
  • Extending the agents to handle more complex dependency graphs might increase the fraction of functions that succeed without any retrieval step.
  • If coverage feedback continues to drive repair, the method could be applied to legacy firmware where existing tests are sparse.

Load-bearing premise

The iterative repair loop driven by build logs and line-coverage feedback will converge to compilable high-coverage tests for most functions without human-written fixes.

What would settle it

Applying the workflow to additional OpenSIL functions or a different firmware codebase and finding that many tests still fail to compile after several repair iterations or require human intervention would falsify the central claim.

Figures

Figures reproduced from arXiv: 2606.19725 by Aric Leather, Haingo Razafindranto, Jitesh Arora, Ma Toan Bach, Ranveer Sandhu, Tanvir Alam, Yuchi Zheng.

Figure 1
Figure 1. Figure 1: Example XFER table structure. extern DF_COMMON_2_REV_XFER_BLOCK DfCmn2RevPhxXfer ; extern DF_IP2IP_API DfIp2IpApiPhx ; SIL_STATUS InitializeApiDfXPhx ( SIL_CONTEXT * SilContext ) { SIL_STATUS Status ; // Set Cmn2Rev table for DF Status = SilInitCommon2RevXferTable ( SilContext , SilId_DfClass , & DfCmn2RevPhxXfer ) ; if ( Status != SilPass ) { return Status ; } // Set Ip2Ip API for DF return SilInitIp2IpAp… view at source ↗
Figure 2
Figure 2. Figure 2: Example use of an XFER table during initialization. • JavaScript Object Notation (JSON) file. Describes the test inputs used for each test run (each iteration). • Information file (INF). Tells the EDK II build system what to compile and which packages/libraries the UT needs [4]. 2.3 Characteristics of Functions Functions in the openSIL codebase exhibit specific characteristics that influence the approach r… view at source ↗
Figure 3
Figure 3. Figure 3: Retrieving an Ip2IpApi table and calling through a function pointer. a one-shot mock helper from the UT support library, configures the next call to SilGetCommon2RevXferTable to return a test-controlled XFER table and status code. Functions that utilize Ip2Ip functionality: Some functions use an Ip2Ip structure containing function pointers (see [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: VDB integration for retrieval: Embeddings are generated for (i) knowledge base entries, (ii) existing UTs, [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: 11-stage workflow. The loop is controlled by [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: C source with per-line hit/miss annotations derived from LCOV, where 1 = hit and 0 = miss. [PITH_FULL_IMAGE:figures/full_fig_p012_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Iterations, time, and cost with and without LCA. [PITH_FULL_IMAGE:figures/full_fig_p015_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Distribution of time, cost, and tokens for configurations with and without LCA. [PITH_FULL_IMAGE:figures/full_fig_p015_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Time (s), cost, and total tokens by function category for the [PITH_FULL_IMAGE:figures/full_fig_p016_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Time (s), cost, and total tokens by function category for the [PITH_FULL_IMAGE:figures/full_fig_p016_10.png] view at source ↗
read the original abstract

Validating changes in low-level C firmware is expensive because unit tests (UTs) are fragile under strict build constraints, where missing headers, unresolved symbols, and dependency mismatches frequently prevent compilation and linking. This study introduces an automated UT authoring workflow for the Open-Source Silicon Initialization Library (openSIL) firmware codebase maintained by Advanced Micro Devices (AMD) that reduces manual effort through a large language model (LLM) guided multi-agent pipeline. The workflow combines automated generation of test scaffolds, library-aware creation or reuse of stubs, mocks, and fakes, and an iterative compile-dispatch repair loop driven by build logs and line-coverage feedback. We evaluate the approach using compilation success, repair iterations, dispatch success, and line coverage, with time, cost, and token usage as secondary measures. Across 76 functions under test, the workflow generated compilable UTs for 73 functions. In a configuration without line coverage guidance or retrieval augmentation, mean line coverage reached 73.9%. On a 48-function subset evaluated under both configurations, mean line coverage reached 98.8% with line-coverage guidance alone and reached 94.7% when combined with vector-database retrieval. Results show that automated generation-and-repair pipelines can substantially improve UT creation efficiency and coverage for constrained firmware environments while reducing manual debugging effort.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper presents a multi-agent LLM workflow for generating unit tests in the OpenSIL firmware codebase. It combines test scaffold generation, library-aware stub/mock creation, and an iterative compile-dispatch-repair loop using build logs and line-coverage feedback. Empirical results on 76 functions report 73 compilable tests; mean line coverage is 73.9% without guidance and reaches 98.8% with line-coverage guidance on a 48-function subset.

Significance. If the results hold, the work addresses a practical pain point in low-level firmware validation under strict build constraints and supplies concrete, reproducible metrics (compilation counts, coverage, token/cost usage) on a real AMD-maintained codebase. These strengths make the contribution potentially useful for SE practitioners working with constrained C environments.

major comments (3)
  1. [Evaluation] Evaluation section: the 73/76 compilation success and the 98.8% coverage figures are reported only for the full multi-agent + iterative-repair workflow. No control arm applies an otherwise identical initial prompt (same model, same library stubs, same context) in a single forward pass, so the incremental contribution of the repair loop cannot be isolated.
  2. [Evaluation] Function selection paragraph (near the start of the evaluation): the criteria used to choose the 76 functions and the precise exclusion rules are not fully specified, which limits reproducibility and makes it impossible to judge whether the reported rates generalize beyond the selected set.
  3. [Evaluation] 48-function subset comparison: the paper reports mean coverage of 98.8% (line-coverage guidance) versus 94.7% (with retrieval) but supplies neither per-function variance, statistical tests, nor confidence intervals, weakening the claim that guidance alone produces a reliable improvement.
minor comments (2)
  1. The abstract states secondary measures (time, cost, token usage) but the main text should tabulate them explicitly alongside the primary metrics for each configuration.
  2. Clarify the exact LLM model, temperature, and maximum repair iterations in the workflow description so that the experiment is fully replicable.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their insightful comments on our manuscript. We address each of the major comments below and outline the revisions we plan to make.

read point-by-point responses
  1. Referee: [Evaluation] Evaluation section: the 73/76 compilation success and the 98.8% coverage figures are reported only for the full multi-agent + iterative-repair workflow. No control arm applies an otherwise identical initial prompt (same model, same library stubs, same context) in a single forward pass, so the incremental contribution of the repair loop cannot be isolated.

    Authors: We agree that a single-pass baseline would strengthen the evaluation by isolating the repair loop's contribution. The manuscript presents the results for the complete workflow, which includes the iterative repair as an essential component for achieving compilable tests under firmware constraints. In the revision, we will expand the evaluation discussion to note the typical number of repair iterations required and the common failure modes observed in initial generations, thereby providing indirect evidence for the loop's value. revision: partial

  2. Referee: [Evaluation] Function selection paragraph (near the start of the evaluation): the criteria used to choose the 76 functions and the precise exclusion rules are not fully specified, which limits reproducibility and makes it impossible to judge whether the reported rates generalize beyond the selected set.

    Authors: We will revise the function selection paragraph to fully specify the criteria used to choose the 76 functions, including the precise exclusion rules and the rationale for selection from the OpenSIL codebase. revision: yes

  3. Referee: [Evaluation] 48-function subset comparison: the paper reports mean coverage of 98.8% (line-coverage guidance) versus 94.7% (with retrieval) but supplies neither per-function variance, statistical tests, nor confidence intervals, weakening the claim that guidance alone produces a reliable improvement.

    Authors: We agree with this observation. The revised manuscript will include per-function coverage data for the 48-function subset, along with variance measures, and we will conduct and report appropriate statistical tests with confidence intervals to substantiate the comparison. revision: yes

Circularity Check

0 steps flagged

No circularity; all claims are direct empirical measurements

full rationale

The manuscript presents an empirical workflow evaluation on a fixed set of 76 openSIL functions, reporting observed outcomes such as 73/76 compilation success and line-coverage percentages (73.9%, 98.8%, 94.7%) under different configurations. No equations, fitted parameters, predictions, or derivations appear in the text that reduce by construction to inputs, self-definitions, or self-citations. Results are measured quantities from running the described pipeline, with no load-bearing theoretical steps or renamings that would trigger any of the enumerated circularity patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The approach rests on the domain assumption that current LLMs can reliably translate compiler diagnostics and coverage reports into code edits; no free parameters, new entities, or ad-hoc axioms beyond standard LLM code-generation capabilities are introduced.

axioms (1)
  • domain assumption LLMs can interpret build logs and coverage reports to generate effective repairs
    The iterative repair loop depends on this capability; invoked in the description of the compile-dispatch repair loop.

pith-pipeline@v0.9.1-grok · 5800 in / 1334 out tokens · 37936 ms · 2026-06-26T17:03:59.408406+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

32 extracted references · 1 canonical work pages

  1. [1]

    On the evaluation of large language models in unit test generation

    Lin Yang, Chen Yang, Shutao Gao, Weijing Wang, Bo Wang, Qihao Zhu, Xiao Chu, Jianyi Zhou, Guangtai Liang, Qianxiang Wang, and Junjie Chen. On the evaluation of large language models in unit test generation. In Proceedings of the 39th IEEE/ACM International Conference on Automated Software Engineering (ASE), 2024. arXiv:2406.18181. 18

  2. [2]

    An evaluation of code coverage adequacy in automatic testing using control flow graph visualization

    Ani Rahmani, Joe Lian Min, and Asri Maspupah. An evaluation of code coverage adequacy in automatic testing using control flow graph visualization. InProceedings of the 2020 IEEE 10th Symposium on Computer Applications and Industrial Electronics (ISCAIE), pages 239–244. IEEE, 2020

  3. [3]

    Barr, Mark Harman, Phil McMinn, Muzammil Shahbaz, and Shin Yoo

    Earl T. Barr, Mark Harman, Phil McMinn, Muzammil Shahbaz, and Shin Yoo. The oracle problem in software testing: A survey.IEEE Transactions on Software Engineering, 41(5):507–525, 2015

  4. [4]

    EDK II build system, 2025

    Tianocore. EDK II build system, 2025. Accessed: 2025-06-14

  5. [5]

    Addison-Wesley Professional, 2004

    Michael Feathers.Working Effectively with Legacy Code. Addison-Wesley Professional, 2004

  6. [6]

    An empirical evaluation of using large language models for automated unit test generation.IEEE Transactions on Software Engineering, 50(1):85–105, January 2024

    Michael Schäfer, Sarah Nadi, Armin Eghbali, and Frank Tip. An empirical evaluation of using large language models for automated unit test generation.IEEE Transactions on Software Engineering, 50(1):85–105, January 2024

  7. [7]

    ASTER: Natural and multi- language unit test generation with LLMs

    Rangeet Pan, Myeongsoo Kim, Rahul Krishna, Raju Pavuluri, and Saurabh Sinha. ASTER: Natural and multi- language unit test generation with LLMs. InProceedings of the IEEE/ACM 47th International Conference on Software Engineering: Software Engineering in Practice (ICSE-SEIP), pages 413–424, 2025

  8. [8]

    Evaluating and improving ChatGPT for unit test generation

    Zhiqiang Yuan, Mingwei Liu, Shiji Ding, Kaixin Wang, Yixuan Chen, Xin Peng, and Yiling Lou. Evaluating and improving ChatGPT for unit test generation. InProceedings of the ACM on Software Engineering (ESEC/FSE), 2024

  9. [9]

    LCOV: Code coverage report generator, 2025

    Linux Test Project. LCOV: Code coverage report generator, 2025. Accessed: 2025

  10. [10]

    Whalen, and Mats P

    Matt Staats, Michael W. Whalen, and Mats P. E. Heimdahl. Programs, tests, and oracles: The foundations of testing revisited. InProceedings of the 33rd International Conference on Software Engineering (ICSE), pages 391–400. ACM, 2011

  11. [11]

    ld: The GNU linker (Options: --wrap), 2025

    GNU Binutils. ld: The GNU linker (Options: --wrap), 2025. Accessed: 2025

  12. [12]

    Ceedling: BDD-style unit testing for C, 2025

    Ceedling. Ceedling: BDD-style unit testing for C, 2025. Accessed: 2025

  13. [13]

    KLEE: Unassisted and automatic generation of high-coverage tests for complex systems programs

    Cristian Cadar, Daniel Dunbar, and Dawson Engler. KLEE: Unassisted and automatic generation of high-coverage tests for complex systems programs. InProceedings of the 8th USENIX Symposium on Operating Systems Design and Implementation (OSDI), pages 209–224. USENIX, 2008

  14. [14]

    An analysis and survey of the development of mutation testing.IEEE Transactions on Software Engineering, 37(5):649–678, 2011

    Yue Jia and Mark Harman. An analysis and survey of the development of mutation testing.IEEE Transactions on Software Engineering, 37(5):649–678, 2011

  15. [15]

    EvoSuite: Automatic test suite generation for object-oriented software

    Gordon Fraser and Andrea Arcuri. EvoSuite: Automatic test suite generation for object-oriented software. In Proceedings of the 19th ACM SIGSOFT Symposium on the Foundations of Software Engineering (ESEC/FSE), pages 416–419. ACM, 2011

  16. [16]

    Leveraging large language models for enhancing the understandability of generated unit tests

    Amirhossein Deljouyi, Roham Koohestani, Maliheh Izadi, and Andy Zaidman. Leveraging large language models for enhancing the understandability of generated unit tests. InProceedings of the 47th IEEE/ACM International Conference on Software Engineering (ICSE), 2025

  17. [17]

    Desmarais

    Arghavan Moradi Dakhel, Amin Nikanjam, Vahid Majdinasab, Foutse Khomh, and Michel C. Desmarais. Effective test generation using pre-trained large language models and mutation testing.Information and Software Technology, 171:107468, July 2024

  18. [18]

    Harnessing the power of LLMs: Automating unit test generation for high-performance computing, 2024

    Rabimba Karanjai, Aftab Hussain, Md Rafiqul Islam Rabin, Lei Xu, Weidong Shi, and Mohammad Amin Alipour. Harnessing the power of LLMs: Automating unit test generation for high-performance computing, 2024

  19. [19]

    Large language models for unit testing: A systematic literature review, 2025

    Quanjun Zhang, Chunrong Fang, Siqi Gu, Ye Shang, Zhenyu Chen, and Liang Xiao. Large language models for unit testing: A systematic literature review, 2025

  20. [20]

    Software testing with large language models: Survey, landscape, and vision.IEEE Transactions on Software Engineering, 50(4):911–936, April 2024

    Junjie Wang, Yuchao Huang, Chunyang Chen, Zhe Liu, Song Wang, and Qing Wang. Software testing with large language models: Survey, landscape, and vision.IEEE Transactions on Software Engineering, 50(4):911–936, April 2024

  21. [21]

    Large language models: A survey, 2024

    Shervin Minaee, Tomás Mikolov, Narjes Nikzad, Meysam Chenaghlu, Richard Socher, Xavier Amatriain, and Jianfeng Gao. Large language models: A survey, 2024

  22. [22]

    Introducing OpenAI o3 and o4-mini, 2025

    OpenAI. Introducing OpenAI o3 and o4-mini, 2025. Accessed: 2025-06-26

  23. [23]

    Fenton and James Bieman.Software Metrics: A Rigorous and Practical Approach

    Norman E. Fenton and James Bieman.Software Metrics: A Rigorous and Practical Approach. CRC Press, 3 edition, 2014

  24. [24]

    CITYW ALK: Enhancing LLM-based C++ unit test generation via project-dependency awareness and language-specific knowledge, 2025

    Yuwei Zhang, Qingyuan Lu, Kai Liu, Wensheng Dou, Jiaxin Zhu, Li Qian, Chunxi Zhang, Zheng Lin, and Jun Wei. CITYW ALK: Enhancing LLM-based C++ unit test generation via project-dependency awareness and language-specific knowledge, 2025. 19

  25. [25]

    Retrieval-augmented generation for knowledge-intensive NLP tasks

    Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, Sebastian Riedel, and Douwe Kiela. Retrieval-augmented generation for knowledge-intensive NLP tasks. InAdvances in Neural Information Processing Systems (NeurIPS), volume 33, pages 9459–9474, 2020

  26. [26]

    RepoCoder: Repository-level code completion through iterative retrieval and generation, 2023

    Fengji Zhang, Bei Chen, Yue Zhang, Jacky Keung, Jin Liu, Daoguang Zan, Yi Mao, Jian-Guang Lou, and Weizhu Chen. RepoCoder: Repository-level code completion through iterative retrieval and generation, 2023

  27. [27]

    CodeRAG: Finding relevant and necessary knowledge for retrieval-augmented repository-level code completion

    Siyuan Zhang, Ying Ding, Shuaijun Lian, Shuai Song, and Hao Li. CodeRAG: Finding relevant and necessary knowledge for retrieval-augmented repository-level code completion. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2025

  28. [28]

    Lahiri, and Sanjit Sen

    Caroline Lemieux, Jeevana Priya Inala, Shuvendu K. Lahiri, and Sanjit Sen. CodaMosa: Escaping coverage plateaus in test generation with pre-trained large language models. InProceedings of the 45th IEEE/ACM International Conference on Software Engineering (ICSE), pages 919–931. IEEE, 2023

  29. [29]

    Automatically finding patches using genetic programming

    Westley Weimer, ThanhVu Nguyen, Claire Le Goues, and Stephanie Forrest. Automatically finding patches using genetic programming. InProceedings of the 31st International Conference on Software Engineering (ICSE), pages 364–374. IEEE, 2009

  30. [30]

    Automatic software repair: A bibliography, 2018

    Martin Monperrus. Automatic software repair: A bibliography, 2018

  31. [31]

    Valentin J. M. Manès, HyungSeok Han, Choongwoo Han, Sang Kil Cha, Manuel Egele, Edward J. Schwartz, and Maverick Woo. The art, science, and engineering of fuzzing: A survey.IEEE Transactions on Software Engineering, 47(11):2312–2331, 2021

  32. [32]

    Aider LLM leaderboards, 2025

    Aider. Aider LLM leaderboards, 2025. Accessed: 2025-09-06. 20