Recognition: no theorem link
Assessing REST API Test Generation Strategies with Log Coverage
Pith reviewed 2026-05-10 17:43 UTC · model grok-4.3
The pith
Claude Opus 4.6 tests uncover 28.4% more unique log templates than human-written tests for REST APIs.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Using proposed average, minimum, and maximum log coverage metrics, the empirical study on Light-OAuth2 finds that Claude Opus 4.6 tests uncover 28.4% more unique log templates than human-written tests, whereas EvoMaster finds 26.1% fewer and GPT-5.2-Codex finds 38.6% fewer. Combining human-written tests with Claude increases total observed log coverage by 78.4%, and similar large increases occur with other pairs, indicating largely distinct runtime behaviors across strategies.
What carries the argument
Log coverage metrics based on the number of distinct log templates observed during test execution, used as a proxy for exercised runtime behaviors in black-box REST API testing.
If this is right
- Combining human and Claude tests increases log coverage by 78.4% over human alone.
- EvoMaster and human tests together increase coverage by 30.7%.
- GPT-5.2-Codex and human tests increase coverage by 26.1%.
- The substantial gains from combinations show that the generation strategies exercise largely distinct runtime behaviors.
Where Pith is reading between the lines
- Log coverage offers a practical evaluation method for polyglot systems lacking easy code instrumentation.
- Hybrid approaches that combine multiple test generation strategies could yield higher coverage than any single strategy.
- Repeating the study on additional REST API systems would test whether Claude's performance advantage generalizes.
- The observed complementarity could inform the design of adaptive test generation tools that switch between strategies.
Load-bearing premise
That the count of distinct log templates is a valid and sufficient proxy for the diversity of runtime behaviors exercised by the tests.
What would settle it
Finding no correlation between the number of unique log templates and actual code coverage or fault detection rates when instrumenting the system would undermine the metric's validity.
read the original abstract
Assessing the effectiveness of REST API tests in black-box settings can be challenging due to the lack of access to source code coverage metrics and polyglot tech stack. We propose three metrics for capturing average, minimum, and maximum log coverage to handle the diverse test generation results and runtime behaviors over multiple runs. Using log coverage, we empirically evaluate three REST API test generation strategies, Evolutionary computing (EvoMaster v5.0.2), LLMs (Claude Opus 4.6 and GPT-5.2-Codex), and human-written Locust load tests, on Light-OAuth2 authorization microservice system. On average, Claude Opus 4.6 tests uncover 28.4% more unique log templates than human-written tests, whereas EvoMaster and GPT-5.2-Codex find 26.1% and 38.6% fewer, respectively. Next, we analyze combined log coverage to assess complementarity between strategies. Combining human-written tests with Claude Opus 4.6 tests increases total observed log coverage by 78.4% and 38.9% in human-written and Claude tests respectively. When combining Locust tests with EvoMaster the same increases are 30.7% and 76.9% and when using GPT-5.2-Codex 26.1% and 105.6%. This means that the generation strategies exercise largely distinct runtime behaviors. Our future work includes extending our study to multiple systems.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes three log coverage metrics (average, minimum, and maximum) to assess REST API test generation strategies in black-box settings lacking source code access. It empirically evaluates evolutionary computing (EvoMaster v5.0.2), LLMs (Claude Opus 4.6 and GPT-5.2-Codex), and human-written Locust tests on the Light-OAuth2 microservice, reporting that Claude tests uncover 28.4% more unique log templates than human tests on average while EvoMaster and GPT find 26.1% and 38.6% fewer. Complementarity analysis shows coverage increases (e.g., 78.4% when combining human and Claude tests), indicating distinct runtime behaviors, with future work extending to multiple systems.
Significance. If the log template count is shown to be a reliable proxy for exercised runtime behaviors, the work supplies a practical black-box alternative to code coverage for evaluating test generators in polyglot microservices. The min/avg/max metrics address run-to-run variability, the empirical comparisons quantify LLM potential (particularly Claude) relative to evolutionary and human baselines, and the complementarity results support hybrid testing strategies. These elements advance empirical software engineering knowledge on automated REST API testing effectiveness.
major comments (3)
- The quantitative claims (Claude +28.4%, EvoMaster -26.1%, GPT -38.6%) and complementarity percentages rest on log template counts but provide no information on the number of runs, statistical tests, variance, or the exact procedure for extracting and deduplicating log templates. These details are required to substantiate the reported differences and are load-bearing for the central empirical results.
- The study uses a single system (Light-OAuth2). While the abstract notes future extension to multiple systems, the current single-system design limits generalizability of the relative effectiveness and complementarity claims; this is a load-bearing limitation for the paper's conclusions about strategy differences.
- The assumption that the number of distinct log templates is a sufficient proxy for diversity of runtime behaviors is unvalidated. The paper offers no manual inspection of templates, mapping to API operations/endpoints, or cross-check against other observables (e.g., response status distributions) to confirm that higher counts reflect broader behavior coverage rather than logging artifacts. This directly affects the validity of all reported coverage differences.
minor comments (2)
- The abstract introduces the three log coverage metrics but does not define their precise computation from observed templates across runs; this definition should appear in the methods section for clarity.
- Notation for the proposed metrics (average, minimum, maximum log coverage) should be introduced explicitly and used consistently to avoid ambiguity when discussing results.
Simulated Author's Rebuttal
We thank the referee for their constructive feedback on our manuscript. We address each major comment point-by-point below, outlining our responses and planned revisions to improve the clarity and validity of our work.
read point-by-point responses
-
Referee: The quantitative claims (Claude +28.4%, EvoMaster -26.1%, GPT -38.6%) and complementarity percentages rest on log template counts but provide no information on the number of runs, statistical tests, variance, or the exact procedure for extracting and deduplicating log templates. These details are required to substantiate the reported differences and are load-bearing for the central empirical results.
Authors: We agree that additional details are necessary to substantiate the empirical results. The current manuscript describes the overall approach but omits specifics on the experimental protocol. In the revised version, we will add a section detailing the number of runs conducted for each test generation strategy (to account for non-determinism in LLMs and evolutionary algorithms), the variance observed across runs, and the exact log template extraction and deduplication process used (based on parsing log messages to normalize variable parts). We will also include any statistical comparisons performed or note their absence with justification. This revision will make the quantitative claims more robust and reproducible. revision: yes
-
Referee: The study uses a single system (Light-OAuth2). While the abstract notes future extension to multiple systems, the current single-system design limits generalizability of the relative effectiveness and complementarity claims; this is a load-bearing limitation for the paper's conclusions about strategy differences.
Authors: We acknowledge this as a valid limitation of the current study. Although the manuscript indicates plans for future work on multiple systems, we will revise the paper to include a more prominent discussion of this limitation in the threats to validity section. This will include considerations of how system-specific factors, such as the logging configuration in Light-OAuth2, might affect the observed differences. We maintain that the results provide valuable insights into the complementarity of strategies even in this single-system context, but agree that broader evaluation is needed for stronger conclusions. revision: partial
-
Referee: The assumption that the number of distinct log templates is a sufficient proxy for diversity of runtime behaviors is unvalidated. The paper offers no manual inspection of templates, mapping to API operations/endpoints, or cross-check against other observables (e.g., response status distributions) to confirm that higher counts reflect broader behavior coverage rather than logging artifacts. This directly affects the validity of all reported coverage differences.
Authors: This is an important point regarding the validity of our proposed metrics. In the absence of source code access, log templates serve as a black-box indicator of exercised code paths through runtime logging. However, we recognize the lack of explicit validation in the current manuscript. We will revise the discussion section to provide a stronger rationale for the proxy, drawing on related literature in log mining for software analysis. Furthermore, we will include an additional analysis involving manual review of representative log templates and their relation to API endpoints, as well as reporting on other runtime observables like response status code distributions to cross-validate the coverage differences. revision: yes
Circularity Check
No circularity: purely empirical observations from executed tests
full rationale
The paper defines log coverage metrics (average, minimum, maximum unique log templates) and reports direct empirical measurements obtained by executing the generated tests on Light-OAuth2. The headline percentages (28.4% more, 26.1% fewer, 38.6% fewer) and complementarity increases are simple arithmetic comparisons of observed template counts across runs; they do not reduce to any fitted parameter, self-referential definition, or self-citation chain. No equations, uniqueness theorems, or ansatzes are invoked. The evaluation is self-contained against the runtime logs produced by the actual test executions.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Distinct log templates correspond to meaningfully different runtime behaviors exercised by the tests
Reference graph
Works this paper leans on
-
[1]
2025.Certified Tester Specialist Level - Testing with Generative AI (CT- GenAI) Syllabus
Abbas Ahmad, Gualtiero Bazzana, Alessandro Collino, Olivier Denoo, and Bruno Legeard. 2025.Certified Tester Specialist Level - Testing with Generative AI (CT- GenAI) Syllabus. International Software Testing Qualifications Board (ISTQB). https://istqb.org/sdm_downloads/ct-genai-syllabus-v1-0/
work page 2025
-
[2]
J.H. Andrews. 1998. Testing using log file analysis: tools, methods, and issues. In Proceedings 13th IEEE International Conference on Automated Software Engineering (Cat. No.98EX239). 157–166. doi:10.1109/ASE.1998.732614
-
[3]
Andrea Arcuri. 2019. RESTful API automated test case generation with EvoMaster. ACM Transactions on Software Engineering and Methodology (TOSEM)28, 1 (2019), 1–37
work page 2019
-
[4]
Vaggelis Atlidakis, Patrice Godefroid, and Marina Polishchuk. 2019. REST-ler: Stateful REST API Fuzzing. InProceedings of the 41st International Conference on Software Engineering (ICSE). 748–758. doi:10.1109/ICSE.2019.00083
-
[5]
Alexander Bakhtin, Jesse Nyyssölä, Yuqing Wang, Noman Ahmad, Ke Ping, Mat- teo Esposito, Mika Mäntylä, and Davide Taibi. 2025. LO2: Microservice API Anomaly Dataset of Logs and Metrics. InProceedings of the 21st International Conference on Predictive Models and Data Analytics in Software Engineering(Trond- heim, Norway)(PROMISE ’25). Association for Compu...
- [6]
-
[7]
Boyuan Chen, Jian Song, Peng Xu, Xing Hu, and Zhen Ming (Jack) Jiang. 2018. An automated approach to estimating code coverage measures via execution logs. InProceedings of the 33rd ACM/IEEE International Conference on Automated Software Engineering(Montpellier, France)(ASE ’18). Association for Computing Machinery, New York, NY, USA, 305–316. doi:10.1145/...
-
[8]
Jinfu Chen, Weiyi Shang, Ahmed E. Hassan, Yong Wang, and Jiangbin Lin. 2019. An Experience Report of Generating Load Tests Using Log-Recovered Workloads at Varying Granularities of User Behaviour. In2019 34th IEEE/ACM International Conference on Automated Software Engineering (ASE). 669–681. doi:10.1109/ASE. 2019.00068
work page doi:10.1109/ase 2019
-
[9]
Lizhe Chen, Ji Wu, Haiyan Yang, and Kui Zhang. 2021. Microservice Test Suite Minimization Technology Based on Logs Mining.Ruan Jian Xue Bao/Journal of Software32, 9 (2021), 2729 – 2743. doi:10.13328/j.cnki.jos.006075 Cited by: 4
- [10]
-
[11]
Mark Harman, Peter O’Hearn, and Shubho Sengupta. 2025. Harden and Catch for Just-in-Time Assured LLM-Based Software Testing: Open Research Challenges. InProceedings of the 33rd ACM International Conference on the Foundations of Software Engineering. 1–17
work page 2025
-
[12]
Mark Harman, Jillian Ritchey, Inna Harper, Shubho Sengupta, Ke Mao, Abhishek Gulati, Christopher Foster, and Hervé Robert. 2025. Mutation-Guided LLM-based Test Generation at Meta. InProceedings of the 33rd ACM International Conference on the Foundations of Software Engineering. 180–191
work page 2025
-
[13]
Zac Hatfield-Dodds and Dmitry Dygalo. 2022. Deriving Semantics-Aware Fuzzers from Web API Schemas. InProceedings of the 44th International Conference on Software Engineering (ICSE). 1–12
work page 2022
-
[14]
Pinjia He, Jieming Zhu, Zibin Zheng, and Michael R. Lyu. 2017. Drain: An Online Log Parsing Approach with Fixed Depth Tree. In2017 IEEE International Conference on Web Services (ICWS). 33–40. doi:10.1109/ICWS.2017.13
-
[15]
Myeongsoo Kim, Saurabh Sinha, and Alessandro Orso. 2025. Llamaresttest: Effective rest api testing with small language models.Proceedings of the ACM on Software Engineering2, FSE (2025), 465–488
work page 2025
- [16]
-
[17]
Mika V Mäntylä, Yuqing Wang, and Jesse Nyyssölä. 2024. Loglead-fast and integrated log loader, enhancer, and anomaly detector. In2024 IEEE International Conference on Software Analysis, Evolution and Reengineering (SANER). IEEE, 395–399
work page 2024
-
[18]
Alberto Martin-Lopez, Sergio Segura, and Antonio Ruiz-Cortes. 2021. RESTest: Automated Testing of RESTful Web APIs with Inter-parameter Dependency Support. In2021 IEEE/ACM 43rd International Conference on Software Engi- neering: Companion Proceedings (ICSE-Companion). 57–60. doi:10.1109/ICSE- Companion52605.2021.00034 Assessing REST API Test Generation St...
-
[19]
Robbe Nooyens, Tolgahan Bardakci, Mutlu Beyazıt, and Serge Demeyer. 2025. Test amplification for rest apis via single and multi-agent llm systems. InIFIP International Conference on Testing Software and Systems. Springer, 161–177
work page 2025
-
[20]
Xuetao Tian, Honghui Li, and Feng Liu. 2017. Web Service Reliability Test Method Based on Log Analysis. In2017 IEEE International Conference on Software Quality, Reliability and Security Companion (QRS-C). 195–199. doi:10.1109/QRS-C.2017.38
-
[21]
Emanuele Viglianisi, Michael Dallago, and Mariano Ceccato. 2020. RESTTEST- GEN: Automated Black-Box Testing of RESTful APIs. In2020 IEEE 13th Interna- tional Conference on Software Testing, Verification and Validation (ICST). 142–152. doi:10.1109/ICST46399.2020.00024
-
[22]
Xiaoyan Xu, Filipe R. Cogo, and Shane McIntosh. 2024. Mitigating the Uncertainty and Imprecision of Log-Based Code Coverage Without Requiring Additional Logging Statements.IEEE Transactions on Software Engineering50, 9 (2024), 2350–2362. doi:10.1109/TSE.2024.3435067
-
[23]
Xiaolei Yu, Kai Jia, Wenhua Hu, Jing Tian, and Jianwen Xiang. 2023. Black-Box Test Case Prioritization Using Log Analysis and Test Case Diversity. In2023 IEEE 34th International Symposium on Software Reliability Engineering Workshops (ISSREW). 186–191. doi:10.1109/ISSREW60843.2023.00072
-
[24]
Tianzhu Zhang, Han Qiu, Gabriele Castellano, Myriana Rifai, Chung Shue Chen, and Fabio Pianese. 2023. System Log Parsing: A Survey.IEEE Transactions on Knowledge and Data Engineering35, 8 (2023), 8596–8614. doi:10.1109/TKDE.2022. 3222417
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.