arxiv: 2604.07073 · v2 · submitted 2026-04-08 · 💻 cs.SE

Recognition: no theorem link

Assessing REST API Test Generation Strategies with Log Coverage

Nana Reinikainen , Mika M\"antyl\"a , Yuqing Wang

Authors on Pith no claims yet

Pith reviewed 2026-05-10 17:43 UTC · model grok-4.3

classification 💻 cs.SE

keywords REST API testinglog coveragetest generation strategiesLLMEvoMasterblack-box testingruntime logsLight-OAuth2

0 comments

The pith

Claude Opus 4.6 tests uncover 28.4% more unique log templates than human-written tests for REST APIs.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes three log coverage metrics to assess REST API test generation strategies without access to source code. It evaluates evolutionary, LLM-based, and human-written tests on the Light-OAuth2 system by counting unique log templates observed across runs. Claude Opus 4.6 tests reveal 28.4% more unique templates than human tests on average, while EvoMaster and GPT-5.2-Codex reveal fewer. Combinations of strategies substantially increase overall coverage, showing they exercise different runtime behaviors.

Core claim

Using proposed average, minimum, and maximum log coverage metrics, the empirical study on Light-OAuth2 finds that Claude Opus 4.6 tests uncover 28.4% more unique log templates than human-written tests, whereas EvoMaster finds 26.1% fewer and GPT-5.2-Codex finds 38.6% fewer. Combining human-written tests with Claude increases total observed log coverage by 78.4%, and similar large increases occur with other pairs, indicating largely distinct runtime behaviors across strategies.

What carries the argument

Log coverage metrics based on the number of distinct log templates observed during test execution, used as a proxy for exercised runtime behaviors in black-box REST API testing.

If this is right

Combining human and Claude tests increases log coverage by 78.4% over human alone.
EvoMaster and human tests together increase coverage by 30.7%.
GPT-5.2-Codex and human tests increase coverage by 26.1%.
The substantial gains from combinations show that the generation strategies exercise largely distinct runtime behaviors.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Log coverage offers a practical evaluation method for polyglot systems lacking easy code instrumentation.
Hybrid approaches that combine multiple test generation strategies could yield higher coverage than any single strategy.
Repeating the study on additional REST API systems would test whether Claude's performance advantage generalizes.
The observed complementarity could inform the design of adaptive test generation tools that switch between strategies.

Load-bearing premise

That the count of distinct log templates is a valid and sufficient proxy for the diversity of runtime behaviors exercised by the tests.

What would settle it

Finding no correlation between the number of unique log templates and actual code coverage or fault detection rates when instrumenting the system would undermine the metric's validity.

read the original abstract

Assessing the effectiveness of REST API tests in black-box settings can be challenging due to the lack of access to source code coverage metrics and polyglot tech stack. We propose three metrics for capturing average, minimum, and maximum log coverage to handle the diverse test generation results and runtime behaviors over multiple runs. Using log coverage, we empirically evaluate three REST API test generation strategies, Evolutionary computing (EvoMaster v5.0.2), LLMs (Claude Opus 4.6 and GPT-5.2-Codex), and human-written Locust load tests, on Light-OAuth2 authorization microservice system. On average, Claude Opus 4.6 tests uncover 28.4% more unique log templates than human-written tests, whereas EvoMaster and GPT-5.2-Codex find 26.1% and 38.6% fewer, respectively. Next, we analyze combined log coverage to assess complementarity between strategies. Combining human-written tests with Claude Opus 4.6 tests increases total observed log coverage by 78.4% and 38.9% in human-written and Claude tests respectively. When combining Locust tests with EvoMaster the same increases are 30.7% and 76.9% and when using GPT-5.2-Codex 26.1% and 105.6%. This means that the generation strategies exercise largely distinct runtime behaviors. Our future work includes extending our study to multiple systems.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Claude tests hit more unique log templates than human Locust ones on this single microservice, with clear complementarity when combined, but the whole thing hinges on an unvalidated log proxy.

read the letter

The paper runs three test generators plus human Locust tests against Light-OAuth2 and counts distinct log templates as a stand-in for coverage. Claude Opus 4.6 pulls ahead of the human baseline by 28.4 percent on average unique templates, while EvoMaster and the GPT variant trail by 26 and 39 percent. Merging human tests with Claude lifts the total template count by 78 percent over human alone and 39 percent over Claude alone. Similar gains appear for the other pairs. That complementarity number is the most practical takeaway here.

Referee Report

3 major / 2 minor

Summary. The manuscript proposes three log coverage metrics (average, minimum, and maximum) to assess REST API test generation strategies in black-box settings lacking source code access. It empirically evaluates evolutionary computing (EvoMaster v5.0.2), LLMs (Claude Opus 4.6 and GPT-5.2-Codex), and human-written Locust tests on the Light-OAuth2 microservice, reporting that Claude tests uncover 28.4% more unique log templates than human tests on average while EvoMaster and GPT find 26.1% and 38.6% fewer. Complementarity analysis shows coverage increases (e.g., 78.4% when combining human and Claude tests), indicating distinct runtime behaviors, with future work extending to multiple systems.

Significance. If the log template count is shown to be a reliable proxy for exercised runtime behaviors, the work supplies a practical black-box alternative to code coverage for evaluating test generators in polyglot microservices. The min/avg/max metrics address run-to-run variability, the empirical comparisons quantify LLM potential (particularly Claude) relative to evolutionary and human baselines, and the complementarity results support hybrid testing strategies. These elements advance empirical software engineering knowledge on automated REST API testing effectiveness.

major comments (3)

The quantitative claims (Claude +28.4%, EvoMaster -26.1%, GPT -38.6%) and complementarity percentages rest on log template counts but provide no information on the number of runs, statistical tests, variance, or the exact procedure for extracting and deduplicating log templates. These details are required to substantiate the reported differences and are load-bearing for the central empirical results.
The study uses a single system (Light-OAuth2). While the abstract notes future extension to multiple systems, the current single-system design limits generalizability of the relative effectiveness and complementarity claims; this is a load-bearing limitation for the paper's conclusions about strategy differences.
The assumption that the number of distinct log templates is a sufficient proxy for diversity of runtime behaviors is unvalidated. The paper offers no manual inspection of templates, mapping to API operations/endpoints, or cross-check against other observables (e.g., response status distributions) to confirm that higher counts reflect broader behavior coverage rather than logging artifacts. This directly affects the validity of all reported coverage differences.

minor comments (2)

The abstract introduces the three log coverage metrics but does not define their precise computation from observed templates across runs; this definition should appear in the methods section for clarity.
Notation for the proposed metrics (average, minimum, maximum log coverage) should be introduced explicitly and used consistently to avoid ambiguity when discussing results.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive feedback on our manuscript. We address each major comment point-by-point below, outlining our responses and planned revisions to improve the clarity and validity of our work.

read point-by-point responses

Referee: The quantitative claims (Claude +28.4%, EvoMaster -26.1%, GPT -38.6%) and complementarity percentages rest on log template counts but provide no information on the number of runs, statistical tests, variance, or the exact procedure for extracting and deduplicating log templates. These details are required to substantiate the reported differences and are load-bearing for the central empirical results.

Authors: We agree that additional details are necessary to substantiate the empirical results. The current manuscript describes the overall approach but omits specifics on the experimental protocol. In the revised version, we will add a section detailing the number of runs conducted for each test generation strategy (to account for non-determinism in LLMs and evolutionary algorithms), the variance observed across runs, and the exact log template extraction and deduplication process used (based on parsing log messages to normalize variable parts). We will also include any statistical comparisons performed or note their absence with justification. This revision will make the quantitative claims more robust and reproducible. revision: yes
Referee: The study uses a single system (Light-OAuth2). While the abstract notes future extension to multiple systems, the current single-system design limits generalizability of the relative effectiveness and complementarity claims; this is a load-bearing limitation for the paper's conclusions about strategy differences.

Authors: We acknowledge this as a valid limitation of the current study. Although the manuscript indicates plans for future work on multiple systems, we will revise the paper to include a more prominent discussion of this limitation in the threats to validity section. This will include considerations of how system-specific factors, such as the logging configuration in Light-OAuth2, might affect the observed differences. We maintain that the results provide valuable insights into the complementarity of strategies even in this single-system context, but agree that broader evaluation is needed for stronger conclusions. revision: partial
Referee: The assumption that the number of distinct log templates is a sufficient proxy for diversity of runtime behaviors is unvalidated. The paper offers no manual inspection of templates, mapping to API operations/endpoints, or cross-check against other observables (e.g., response status distributions) to confirm that higher counts reflect broader behavior coverage rather than logging artifacts. This directly affects the validity of all reported coverage differences.

Authors: This is an important point regarding the validity of our proposed metrics. In the absence of source code access, log templates serve as a black-box indicator of exercised code paths through runtime logging. However, we recognize the lack of explicit validation in the current manuscript. We will revise the discussion section to provide a stronger rationale for the proxy, drawing on related literature in log mining for software analysis. Furthermore, we will include an additional analysis involving manual review of representative log templates and their relation to API endpoints, as well as reporting on other runtime observables like response status code distributions to cross-validate the coverage differences. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical observations from executed tests

full rationale

The paper defines log coverage metrics (average, minimum, maximum unique log templates) and reports direct empirical measurements obtained by executing the generated tests on Light-OAuth2. The headline percentages (28.4% more, 26.1% fewer, 38.6% fewer) and complementarity increases are simple arithmetic comparisons of observed template counts across runs; they do not reduce to any fitted parameter, self-referential definition, or self-citation chain. No equations, uniqueness theorems, or ansatzes are invoked. The evaluation is self-contained against the runtime logs produced by the actual test executions.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on treating unique log templates as a faithful proxy for distinct runtime behaviors; no free parameters or invented entities are introduced.

axioms (1)

domain assumption Distinct log templates correspond to meaningfully different runtime behaviors exercised by the tests
Invoked when interpreting higher unique-log counts as better coverage

pith-pipeline@v0.9.0 · 5566 in / 1212 out tokens · 61185 ms · 2026-05-10T17:43:06.340081+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

24 extracted references · 24 canonical work pages

[1]

2025.Certified Tester Specialist Level - Testing with Generative AI (CT- GenAI) Syllabus

Abbas Ahmad, Gualtiero Bazzana, Alessandro Collino, Olivier Denoo, and Bruno Legeard. 2025.Certified Tester Specialist Level - Testing with Generative AI (CT- GenAI) Syllabus. International Software Testing Qualifications Board (ISTQB). https://istqb.org/sdm_downloads/ct-genai-syllabus-v1-0/

work page 2025
[2]

J.H. Andrews. 1998. Testing using log file analysis: tools, methods, and issues. In Proceedings 13th IEEE International Conference on Automated Software Engineering (Cat. No.98EX239). 157–166. doi:10.1109/ASE.1998.732614

work page doi:10.1109/ase.1998.732614 1998
[3]

Andrea Arcuri. 2019. RESTful API automated test case generation with EvoMaster. ACM Transactions on Software Engineering and Methodology (TOSEM)28, 1 (2019), 1–37

work page 2019
[4]

Vaggelis Atlidakis, Patrice Godefroid, and Marina Polishchuk. 2019. REST-ler: Stateful REST API Fuzzing. InProceedings of the 41st International Conference on Software Engineering (ICSE). 748–758. doi:10.1109/ICSE.2019.00083

work page doi:10.1109/icse.2019.00083 2019
[5]

Alexander Bakhtin, Jesse Nyyssölä, Yuqing Wang, Noman Ahmad, Ke Ping, Mat- teo Esposito, Mika Mäntylä, and Davide Taibi. 2025. LO2: Microservice API Anomaly Dataset of Logs and Metrics. InProceedings of the 21st International Conference on Predictive Models and Data Analytics in Software Engineering(Trond- heim, Norway)(PROMISE ’25). Association for Compu...

work page doi:10.1145/3727582.3728682 2025
[6]

Bjarni Haukur Bjarnason, André Silva, and Martin Monperrus. 2026. On Random- ness in Agentic Evals. arXiv:2602.07150 [cs.LG] https://arxiv.org/abs/2602.07150

work page arXiv 2026
[7]

Boyuan Chen, Jian Song, Peng Xu, Xing Hu, and Zhen Ming (Jack) Jiang. 2018. An automated approach to estimating code coverage measures via execution logs. InProceedings of the 33rd ACM/IEEE International Conference on Automated Software Engineering(Montpellier, France)(ASE ’18). Association for Computing Machinery, New York, NY, USA, 305–316. doi:10.1145/...

work page doi:10.1145/3238147.3238214 2018
[8]

Ernst, Jeff H

Jinfu Chen, Weiyi Shang, Ahmed E. Hassan, Yong Wang, and Jiangbin Lin. 2019. An Experience Report of Generating Load Tests Using Log-Recovered Workloads at Varying Granularities of User Behaviour. In2019 34th IEEE/ACM International Conference on Automated Software Engineering (ASE). 669–681. doi:10.1109/ASE. 2019.00068

work page doi:10.1109/ase 2019
[9]

Lizhe Chen, Ji Wu, Haiyan Yang, and Kui Zhang. 2021. Microservice Test Suite Minimization Technology Based on Logs Mining.Ruan Jian Xue Bao/Journal of Software32, 9 (2021), 2729 – 2743. doi:10.13328/j.cnki.jos.006075 Cited by: 4

work page doi:10.13328/j.cnki.jos.006075 2021
[10]

Raffaele Della Corte, Roberto Pietrantuono, and Stefano Russo. 2025. Log-Driven Testing of Microservice Systems with Transformers.Proceedings of the IEEE International Conference on Web Services, ICWS2025 (2025), 835 – 837. doi:10. 1109/ICWS67624.2025.00107 Cited by: 0

work page arXiv 2025
[11]

Mark Harman, Peter O’Hearn, and Shubho Sengupta. 2025. Harden and Catch for Just-in-Time Assured LLM-Based Software Testing: Open Research Challenges. InProceedings of the 33rd ACM International Conference on the Foundations of Software Engineering. 1–17

work page 2025
[12]

Mark Harman, Jillian Ritchey, Inna Harper, Shubho Sengupta, Ke Mao, Abhishek Gulati, Christopher Foster, and Hervé Robert. 2025. Mutation-Guided LLM-based Test Generation at Meta. InProceedings of the 33rd ACM International Conference on the Foundations of Software Engineering. 180–191

work page 2025
[13]

Zac Hatfield-Dodds and Dmitry Dygalo. 2022. Deriving Semantics-Aware Fuzzers from Web API Schemas. InProceedings of the 44th International Conference on Software Engineering (ICSE). 1–12

work page 2022
[14]

Pinjia He, Jieming Zhu, Zibin Zheng, and Michael R. Lyu. 2017. Drain: An Online Log Parsing Approach with Fixed Depth Tree. In2017 IEEE International Conference on Web Services (ICWS). 33–40. doi:10.1109/ICWS.2017.13

work page doi:10.1109/icws.2017.13 2017
[15]

Myeongsoo Kim, Saurabh Sinha, and Alessandro Orso. 2025. Llamaresttest: Effective rest api testing with small language models.Proceedings of the ACM on Software Engineering2, FSE (2025), 465–488

work page 2025
[16]

Huayao Liu, Chang-ai Sun, and Shuo Ma. 2022. RESTCT: Black-box Constraint- based Combinatorial Testing for REST APIs. In2022 IEEE 15th International Conference on Software Testing, Verification and Validation (ICST). 1–12. doi:10. 1109/ICST53961.2022.00045

work page arXiv 2022
[17]

Mika V Mäntylä, Yuqing Wang, and Jesse Nyyssölä. 2024. Loglead-fast and integrated log loader, enhancer, and anomaly detector. In2024 IEEE International Conference on Software Analysis, Evolution and Reengineering (SANER). IEEE, 395–399

work page 2024
[18]

Alberto Martin-Lopez, Sergio Segura, and Antonio Ruiz-Cortes. 2021. RESTest: Automated Testing of RESTful Web APIs with Inter-parameter Dependency Support. In2021 IEEE/ACM 43rd International Conference on Software Engi- neering: Companion Proceedings (ICSE-Companion). 57–60. doi:10.1109/ICSE- Companion52605.2021.00034 Assessing REST API Test Generation St...

work page doi:10.1109/icse- 2021
[19]

Robbe Nooyens, Tolgahan Bardakci, Mutlu Beyazıt, and Serge Demeyer. 2025. Test amplification for rest apis via single and multi-agent llm systems. InIFIP International Conference on Testing Software and Systems. Springer, 161–177

work page 2025
[20]

Xuetao Tian, Honghui Li, and Feng Liu. 2017. Web Service Reliability Test Method Based on Log Analysis. In2017 IEEE International Conference on Software Quality, Reliability and Security Companion (QRS-C). 195–199. doi:10.1109/QRS-C.2017.38

work page doi:10.1109/qrs-c.2017.38 2017
[21]

Emanuele Viglianisi, Michael Dallago, and Mariano Ceccato. 2020. RESTTEST- GEN: Automated Black-Box Testing of RESTful APIs. In2020 IEEE 13th Interna- tional Conference on Software Testing, Verification and Validation (ICST). 142–152. doi:10.1109/ICST46399.2020.00024

work page doi:10.1109/icst46399.2020.00024 2020
[22]

Cogo, and Shane McIntosh

Xiaoyan Xu, Filipe R. Cogo, and Shane McIntosh. 2024. Mitigating the Uncertainty and Imprecision of Log-Based Code Coverage Without Requiring Additional Logging Statements.IEEE Transactions on Software Engineering50, 9 (2024), 2350–2362. doi:10.1109/TSE.2024.3435067

work page doi:10.1109/tse.2024.3435067 2024
[23]

Xiaolei Yu, Kai Jia, Wenhua Hu, Jing Tian, and Jianwen Xiang. 2023. Black-Box Test Case Prioritization Using Log Analysis and Test Case Diversity. In2023 IEEE 34th International Symposium on Software Reliability Engineering Workshops (ISSREW). 186–191. doi:10.1109/ISSREW60843.2023.00072

work page doi:10.1109/issrew60843.2023.00072 2023
[24]

Tianzhu Zhang, Han Qiu, Gabriele Castellano, Myriana Rifai, Chung Shue Chen, and Fabio Pianese. 2023. System Log Parsing: A Survey.IEEE Transactions on Knowledge and Data Engineering35, 8 (2023), 8596–8614. doi:10.1109/TKDE.2022. 3222417

work page doi:10.1109/tkde.2022 2023