Augmenting unit test suites from integration tests

Katerina Paltoglou; Vassilis E. Zafeiris

arxiv: 2604.17508 · v2 · submitted 2026-04-19 · 💻 cs.SE

Augmenting unit test suites from integration tests

Katerina Paltoglou , Vassilis E. Zafeiris This is my paper

Pith reviewed 2026-05-10 05:24 UTC · model grok-4.3

classification 💻 cs.SE

keywords unit test generationintegration testsstatic analysisdynamic analysistest suite augmentationsoftware testingNode.js

0 comments

The pith

Static and dynamic analysis turns integration tests into isolated unit tests that check component dependencies.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes a technique to automatically generate unit tests from existing integration tests in projects where coarse-grained tests dominate the suite. It applies static analysis to identify component boundaries and dynamic analysis to capture execution traces from the integration tests, then produces new unit tests that exercise the same dependencies but without the surrounding context. A sympathetic reader would care because this shift promises better fault localization, quicker test runs, and test suites that more closely match the ideal pyramid with many fine-grained checks at the base.

Core claim

The method employs static and dynamic analysis to augment a test suite by extracting unit tests from integration tests. Integration tests exercise a component together with its dependencies; the analysis isolates the component's interactions so that the generated unit tests can verify those same dependencies independently.

What carries the argument

The static-and-dynamic analysis pipeline that extracts isolated component behaviors and dependency interactions from integration test executions.

If this is right

Generated unit tests provide finer fault localization than the original integration tests.
Overall test execution time decreases while code coverage can increase.
The approach works on twelve open-source Node.js projects and can be ported to other languages.
Test suites can move closer to the recommended pyramid structure with more unit tests at the base.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Developers maintaining legacy codebases with imbalanced test suites could apply the same extraction process.
The generated unit tests might be combined with coverage tools to guide further manual test additions.
Extending the method to track data-flow dependencies could strengthen isolation guarantees.

Load-bearing premise

Static and dynamic analysis can reliably separate a component's own logic from its interactions with dependencies without losing essential semantics or creating false test cases.

What would settle it

A generated unit test that fails to detect a dependency fault caught by its source integration test, or that passes when the integration test fails on the same interaction.

read the original abstract

We propose a method that employs static and dynamic analysis for augmenting a test suite with automatically generated unit tests. The method is most suitable for test suites where the stratification of unit, integration and system tests does not conform to the recommended test pyramid structure: numerous unit tests providing high code coverage and forming the base, fewer integration tests in the middle that verify component collaboration, and far fewer system or UI tests at the top that exercise acceptance or other scenarios of use. Instead, integration and system tests represent the majority of test cases, resulting in coarse-grained tests with limited fault localization and longer execution times. The method leverages integration tests, exercising a component and its dependencies, to generate unit tests that verify component dependencies in isolation. We showcase and empirically evaluate the proposed method in the Node.js platform, although it can be ported and adapted to other languages and platforms. The evaluation is based on a research prototype implemented as a Node.js tool and is conducted in the context of twelve open source JS applications (benchmark projects). Evaluation results support the effectiveness and practicality of our approach.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper shows a concrete prototype for turning integration tests into unit tests via static and dynamic analysis in Node.js, but the evaluation lacks the numbers and controls needed to assess reliability.

read the letter

The core contribution is a method that runs integration tests to capture traces, then uses static analysis to identify component boundaries and dynamic analysis to extract isolated unit tests with mocked dependencies. This targets projects where integration tests dominate and unit tests are scarce, which is a frequent real-world pattern. The authors built a working Node.js tool and applied it to twelve open-source applications, which at least demonstrates feasibility on actual code rather than toy examples.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes a method that applies static and dynamic analysis to integration tests in order to automatically generate unit tests that exercise individual components in isolation. The technique targets Node.js applications whose test suites deviate from the ideal test pyramid (i.e., integration tests dominate), with the goal of improving fault localization and reducing test execution time. A research prototype is implemented and evaluated on twelve open-source JavaScript projects; the authors conclude that the results demonstrate the effectiveness and practicality of the approach.

Significance. If the quantitative results and validity arguments hold, the work addresses a practical pain point in software testing for dynamic languages where integration tests are over-represented. The combination of static and dynamic analysis to decompose integration-test traces is a reasonable technical direction, and the use of twelve real-world projects supplies a useful empirical anchor. The paper would benefit from explicit credit for any reproducible artifacts or threat-to-validity discussion once those elements are strengthened.

major comments (2)

[Abstract] Abstract: the statement that 'evaluation results support the effectiveness and practicality of our approach' is not accompanied by any reported metrics, baseline comparisons, statistical tests, or threats-to-validity discussion. Because the central claim of the paper is empirical, this omission renders the effectiveness assertion impossible to assess from the provided text.
[Evaluation] Evaluation section: the isolation of component behavior from integration-test traces is load-bearing for the method, yet the manuscript does not report how the dynamic analysis handles Node.js-specific features (event loops, callbacks, async/await, prototype mutation) or quantify false-positive mocks and lost interaction semantics. Without such evidence the claim that generated unit tests preserve observable behavior cannot be verified.

minor comments (2)

[Implementation] The description of the prototype implementation would be clearer if accompanied by a high-level data-flow diagram showing the static-analysis, dynamic-tracing, and test-generation stages.
[Evaluation] A table summarizing per-project results (number of integration tests processed, unit tests generated, coverage delta, execution-time reduction) should be added to the evaluation to make the 'support' claim concrete.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address the two major comments below and will revise the manuscript to strengthen the empirical claims and technical details.

read point-by-point responses

Referee: [Abstract] Abstract: the statement that 'evaluation results support the effectiveness and practicality of our approach' is not accompanied by any reported metrics, baseline comparisons, statistical tests, or threats-to-validity discussion. Because the central claim of the paper is empirical, this omission renders the effectiveness assertion impossible to assess from the provided text.

Authors: We agree that the abstract claim requires supporting quantitative details. In the revision we will replace the generic statement with a concise summary of key results from the evaluation on the twelve projects (e.g., average unit tests generated per integration test, observed reductions in execution time, and fault-localization improvements). We will also add a brief reference to the expanded threats-to-validity section. revision: yes
Referee: [Evaluation] Evaluation section: the isolation of component behavior from integration-test traces is load-bearing for the method, yet the manuscript does not report how the dynamic analysis handles Node.js-specific features (event loops, callbacks, async/await, prototype mutation) or quantify false-positive mocks and lost interaction semantics. Without such evidence the claim that generated unit tests preserve observable behavior cannot be verified.

Authors: We acknowledge that the current manuscript provides insufficient detail on Node.js-specific handling and lacks quantitative evidence for behavior preservation. In the revised version we will insert a dedicated subsection describing the dynamic-analysis mechanisms for event loops, callbacks, async/await (via preserved asynchronous wrappers), and prototype mutations. We will also add an empirical quantification—based on manual review of a representative sample of generated tests—reporting false-positive mock rates and any observed loss of interaction semantics, thereby supporting the preservation claim. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical method evaluated on external projects

full rationale

The paper proposes and empirically evaluates a static/dynamic analysis technique to derive unit tests from integration tests on 12 open-source Node.js applications. No equations, parameters fitted to the target result, self-citations as load-bearing premises, or uniqueness theorems appear in the derivation. The central claim rests on prototype implementation and external benchmark results rather than reducing to its own inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The method depends on standard program-analysis assumptions rather than new invented entities or fitted constants.

axioms (1)

domain assumption Integration tests exercise components together with their dependencies in a way that static and dynamic analysis can separate into isolated unit-level behaviors.
This premise is required for the extraction step to produce valid unit tests.

pith-pipeline@v0.9.0 · 5483 in / 1094 out tokens · 23725 ms · 2026-05-10T05:24:01.786143+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

33 extracted references · 33 canonical work pages

[1]

Abdi and S

M. Abdi and S. Demeyer, Test transplantation through dynamic test slicing, 2022 IEEE 22nd International Working Conference on Source Code Analysis and Manipulation (SCAM) , IEEE, 2022, 35–39. 26

work page 2022
[2]

M. Abdi, H. Rocha, S. Demeyer, and A. Bergel,Small-Amp: Test ampliﬁcation in a dynamically typed language, Empirical Software Engineering 27 (2022), 128

work page 2022
[3]

M. Abdi, H. Rocha, A. Bergel, and S. Demeyer, A test ampliﬁcation bot for Pharo/Smalltalk, Journal of Computer Languages 78 (2024), 101255

work page 2024
[4]

Bhatia, T

S. Bhatia, T. Gandhi, D. Kumar, and P . Jalote, Unit test generation using generative ai: A comparative performance analysis of autogeneration tools, Proceedings of the 1st International Workshop on Large Language Models for Code , 2024, 54–61

work page 2024
[5]

Cohn, Succeeding with Agile: Software Development Using Scrum , Addison-Wesley Professional, 2009

M. Cohn, Succeeding with Agile: Software Development Using Scrum , Addison-Wesley Professional, 2009

work page 2009
[6]

Coleman, D

D. Coleman, D. Ash, B. Lowther, and P . Oman, Using metrics to evaluate software system maintainability , Computer 27 (1994), no. 8, 44–49

work page 1994
[7]

Farzandway and F

M. Farzandway and F. Ghassemi, Automated repair of c programs using large language models , arXiv preprint arXiv:2509.01947 (2025)

work page arXiv 2025
[8]

M. H. Halstead, Elements of Software Science (Operating and programming systems series) , Elsevier Science Inc., 1977

work page 1977
[9]

Humble and D

J. Humble and D. Farley, Continuous Delivery: Reliable Software Releases through Build, Test, and Deployment Automation , Addison-Wesley Signature Series (Fowler), Pearson Education, 2010

work page 2010
[10]

Testforge: Feedback-driven, agentic test suite generation

K. Jain and C. L. Goues, Testforge: Feedback-driven, agentic test suite generation, arXiv preprint arXiv:2503.14713 (2025)

work page arXiv 2025
[11]

Kingston, V

S. Kingston, V . K. I Pun, and V . Stolz, Automated clone elimination in python tests , International Symposium on Leveraging Applications of F ormal Methods, Springer, 2024, 97–114

work page 2024
[12]

R. Liu, Z. Zhang, Y . Hu, Y . Lin, X. Gao, and H. Sun, Llm-based unit test generation for dynamically-typed programs , arXiv preprint arXiv:2503.14000 (2025)

work page arXiv 2025
[13]

Lukasczyk and G

S. Lukasczyk and G. Fraser, Pynguin: Automated unit test generation for python , Proceedings of the ACM/IEEE 44th International Conference on Software Engineering: Companion Proceedings , 2022, 168–172

work page 2022
[14]

Lukasczyk, F

S. Lukasczyk, F. Kroiß, and G. Fraser, An empirical study of automated unit test generation for python , Empirical Software Engineering 28 (2023), no. 2, 36

work page 2023
[15]

Martinez, A

M. Martinez, A. Etien, S. Ducasse, and C. Fuhrman, Rtj: a java framework for detecting and refactoring rotten green test cases , Proceedings of the ACM/IEEE 42nd International Conference on Software Engineering: Companion Proceedings , 2020, 69–72

work page 2020
[16]

Nooyens, T

R. Nooyens, T. Bardakci, M. Beyazıt, and S. Demeyer,Test ampliﬁcation for rest apis via single and multi-agent llm systems, IFIP International Conference on Testing Software and Systems, Springer, 2025, 161–177

work page 2025
[17]

Paltoglou, V

K. Paltoglou, V . E. Zafeiris, N. Diamantidis, and E. A. Giakoumakis, Automated refactoring of legacy javascript code to es6 modules , Journal of Systems and Software 181 (2021), 111049

work page 2021
[18]

R. Pan, M. Kim, R. Krishna, R. Pavuluri, and S. Sinha, Aster: Natural and multi-language unit test generation with llms , 2025 IEEE/ACM 47th International Conference on Software Engineering: Software Engineering in Practice (ICSE-SEIP) , IEEE, 2025, 413–424

work page 2025
[19]

Robinson, M

B. Robinson, M. D. Ernst, J. H. Perkins, V . Augustine, and N. Li, Scaling up automated test generation: Automatically generating maintainable regression unit tests for programs , 2011 26th IEEE/ACM International Conference on Automated Software Engineering (ASE 2011) , IEEE, 2011, 23–32

work page 2011
[20]

Schoofs, M

E. Schoofs, M. Abdi, and S. Demeyer, Ampyﬁer: Test ampliﬁcation in python , Journal of Software: Evolution and Process 34 (2022), no. 11, e2490

work page 2022
[21]

Shore and S

J. Shore and S. Warden, The Art of Agile Development , Theory in practice, O’Reilly Media, Incorporated, 2008

work page 2008
[22]

Soremekun, L

E. Soremekun, L. Kirschner, M. Böhme, and A. Zeller, Locating faults with program slicing: an empirical analysis , Empirical Software Engineering 26 (2021), no. 3, 1–45

work page 2021
[23]

H. Sun, D. Bonetta, C. Humer, and W. Binder, Efﬁcient dynamic analysis for node. js , Proceedings of the 27th International Conference on Compiler Construction, 2018, 196–206

work page 2018
[24]

Taromirad and P

M. Taromirad and P . Runeson, Assertions in software testing: survey, landscape, and trends , International Journal on Software Tools for Technology Transfer 27 (2025), no. 1, 117–135

work page 2025
[25]

Tiwari, M

D. Tiwari, M. Monperrus, and B. Baudry, Mimicking production behavior with generated mocks , IEEE Transactions on Software Engineering (2024)

work page 2024
[26]

V ahabzadeh, A

A. V ahabzadeh, A. Stocco, and A. Mesbah, Fine-grained test minimization , Proceedings of the 40th International Conference on Software Engineering, 2018, 210–221

work page 2018
[27]

C. Wei, L. Xiao, T. Y u, S. Wong, and A. Clune, How do developers structure unit test cases? an empirical analysis of the aaa pattern in open source projects, IEEE Transactions on Software Engineering (2025)

work page 2025
[28]

Wohlin, P

C. Wohlin, P . Runeson, M. Höst, M. C. Ohlsson, B. Regnell, A. Wesslén et al., Experimentation in software engineering , Springer, 2012

work page 2012
[29]

Xuan and M

J. Xuan and M. Monperrus, Test case puriﬁcation for improving fault localization , Proceedings of the 22nd ACM SIGSOFT international symposium on foundations of software engineering , 2014, 52–63

work page 2014
[30]

J. Xuan, B. Cornu, M. Martinez, B. Baudry, L. Seinturier, and M. Monperrus, B-refactoring: Automatic test code refactoring to improve dynamic analysis, Information and Software Technology 76 (2016), 65–80

work page 2016
[31]

On the evaluation of large language models in unit test generation,

L. Y ang et al., An empirical study of unit test generation with large language models , arXiv preprint arXiv:2406.18181 (2024)

work page arXiv 2024
[32]

Evaluating and improving chatgpt for unit test generation,

Z. Y uan, M. Liu, S. Ding, K. Wang, Y . Chen, X. Peng, and Y . Lou,Evaluating and improving chatgpt for unit test generation , Proc. ACM Softw. Eng. 1 (2024), no. FSE. URL https://doi.org/10.1145/3660783

work page doi:10.1145/3660783 2024
[33]

Y . Zhang et al., Citywalk: Enhancing llm-based c++ unit test generation via project-dependency awareness and language-speciﬁc knowledge , ACM Transactions on Software Engineering and Methodology (2025)

work page 2025

[1] [1]

Abdi and S

M. Abdi and S. Demeyer, Test transplantation through dynamic test slicing, 2022 IEEE 22nd International Working Conference on Source Code Analysis and Manipulation (SCAM) , IEEE, 2022, 35–39. 26

work page 2022

[2] [2]

M. Abdi, H. Rocha, S. Demeyer, and A. Bergel,Small-Amp: Test ampliﬁcation in a dynamically typed language, Empirical Software Engineering 27 (2022), 128

work page 2022

[3] [3]

M. Abdi, H. Rocha, A. Bergel, and S. Demeyer, A test ampliﬁcation bot for Pharo/Smalltalk, Journal of Computer Languages 78 (2024), 101255

work page 2024

[4] [4]

Bhatia, T

S. Bhatia, T. Gandhi, D. Kumar, and P . Jalote, Unit test generation using generative ai: A comparative performance analysis of autogeneration tools, Proceedings of the 1st International Workshop on Large Language Models for Code , 2024, 54–61

work page 2024

[5] [5]

Cohn, Succeeding with Agile: Software Development Using Scrum , Addison-Wesley Professional, 2009

M. Cohn, Succeeding with Agile: Software Development Using Scrum , Addison-Wesley Professional, 2009

work page 2009

[6] [6]

Coleman, D

D. Coleman, D. Ash, B. Lowther, and P . Oman, Using metrics to evaluate software system maintainability , Computer 27 (1994), no. 8, 44–49

work page 1994

[7] [7]

Farzandway and F

M. Farzandway and F. Ghassemi, Automated repair of c programs using large language models , arXiv preprint arXiv:2509.01947 (2025)

work page arXiv 2025

[8] [8]

M. H. Halstead, Elements of Software Science (Operating and programming systems series) , Elsevier Science Inc., 1977

work page 1977

[9] [9]

Humble and D

J. Humble and D. Farley, Continuous Delivery: Reliable Software Releases through Build, Test, and Deployment Automation , Addison-Wesley Signature Series (Fowler), Pearson Education, 2010

work page 2010

[10] [10]

Testforge: Feedback-driven, agentic test suite generation

K. Jain and C. L. Goues, Testforge: Feedback-driven, agentic test suite generation, arXiv preprint arXiv:2503.14713 (2025)

work page arXiv 2025

[11] [11]

Kingston, V

S. Kingston, V . K. I Pun, and V . Stolz, Automated clone elimination in python tests , International Symposium on Leveraging Applications of F ormal Methods, Springer, 2024, 97–114

work page 2024

[12] [12]

R. Liu, Z. Zhang, Y . Hu, Y . Lin, X. Gao, and H. Sun, Llm-based unit test generation for dynamically-typed programs , arXiv preprint arXiv:2503.14000 (2025)

work page arXiv 2025

[13] [13]

Lukasczyk and G

S. Lukasczyk and G. Fraser, Pynguin: Automated unit test generation for python , Proceedings of the ACM/IEEE 44th International Conference on Software Engineering: Companion Proceedings , 2022, 168–172

work page 2022

[14] [14]

Lukasczyk, F

S. Lukasczyk, F. Kroiß, and G. Fraser, An empirical study of automated unit test generation for python , Empirical Software Engineering 28 (2023), no. 2, 36

work page 2023

[15] [15]

Martinez, A

M. Martinez, A. Etien, S. Ducasse, and C. Fuhrman, Rtj: a java framework for detecting and refactoring rotten green test cases , Proceedings of the ACM/IEEE 42nd International Conference on Software Engineering: Companion Proceedings , 2020, 69–72

work page 2020

[16] [16]

Nooyens, T

R. Nooyens, T. Bardakci, M. Beyazıt, and S. Demeyer,Test ampliﬁcation for rest apis via single and multi-agent llm systems, IFIP International Conference on Testing Software and Systems, Springer, 2025, 161–177

work page 2025

[17] [17]

Paltoglou, V

K. Paltoglou, V . E. Zafeiris, N. Diamantidis, and E. A. Giakoumakis, Automated refactoring of legacy javascript code to es6 modules , Journal of Systems and Software 181 (2021), 111049

work page 2021

[18] [18]

R. Pan, M. Kim, R. Krishna, R. Pavuluri, and S. Sinha, Aster: Natural and multi-language unit test generation with llms , 2025 IEEE/ACM 47th International Conference on Software Engineering: Software Engineering in Practice (ICSE-SEIP) , IEEE, 2025, 413–424

work page 2025

[19] [19]

Robinson, M

B. Robinson, M. D. Ernst, J. H. Perkins, V . Augustine, and N. Li, Scaling up automated test generation: Automatically generating maintainable regression unit tests for programs , 2011 26th IEEE/ACM International Conference on Automated Software Engineering (ASE 2011) , IEEE, 2011, 23–32

work page 2011

[20] [20]

Schoofs, M

E. Schoofs, M. Abdi, and S. Demeyer, Ampyﬁer: Test ampliﬁcation in python , Journal of Software: Evolution and Process 34 (2022), no. 11, e2490

work page 2022

[21] [21]

Shore and S

J. Shore and S. Warden, The Art of Agile Development , Theory in practice, O’Reilly Media, Incorporated, 2008

work page 2008

[22] [22]

Soremekun, L

E. Soremekun, L. Kirschner, M. Böhme, and A. Zeller, Locating faults with program slicing: an empirical analysis , Empirical Software Engineering 26 (2021), no. 3, 1–45

work page 2021

[23] [23]

H. Sun, D. Bonetta, C. Humer, and W. Binder, Efﬁcient dynamic analysis for node. js , Proceedings of the 27th International Conference on Compiler Construction, 2018, 196–206

work page 2018

[24] [24]

Taromirad and P

M. Taromirad and P . Runeson, Assertions in software testing: survey, landscape, and trends , International Journal on Software Tools for Technology Transfer 27 (2025), no. 1, 117–135

work page 2025

[25] [25]

Tiwari, M

D. Tiwari, M. Monperrus, and B. Baudry, Mimicking production behavior with generated mocks , IEEE Transactions on Software Engineering (2024)

work page 2024

[26] [26]

V ahabzadeh, A

A. V ahabzadeh, A. Stocco, and A. Mesbah, Fine-grained test minimization , Proceedings of the 40th International Conference on Software Engineering, 2018, 210–221

work page 2018

[27] [27]

C. Wei, L. Xiao, T. Y u, S. Wong, and A. Clune, How do developers structure unit test cases? an empirical analysis of the aaa pattern in open source projects, IEEE Transactions on Software Engineering (2025)

work page 2025

[28] [28]

Wohlin, P

C. Wohlin, P . Runeson, M. Höst, M. C. Ohlsson, B. Regnell, A. Wesslén et al., Experimentation in software engineering , Springer, 2012

work page 2012

[29] [29]

Xuan and M

J. Xuan and M. Monperrus, Test case puriﬁcation for improving fault localization , Proceedings of the 22nd ACM SIGSOFT international symposium on foundations of software engineering , 2014, 52–63

work page 2014

[30] [30]

J. Xuan, B. Cornu, M. Martinez, B. Baudry, L. Seinturier, and M. Monperrus, B-refactoring: Automatic test code refactoring to improve dynamic analysis, Information and Software Technology 76 (2016), 65–80

work page 2016

[31] [31]

On the evaluation of large language models in unit test generation,

L. Y ang et al., An empirical study of unit test generation with large language models , arXiv preprint arXiv:2406.18181 (2024)

work page arXiv 2024

[32] [32]

Evaluating and improving chatgpt for unit test generation,

Z. Y uan, M. Liu, S. Ding, K. Wang, Y . Chen, X. Peng, and Y . Lou,Evaluating and improving chatgpt for unit test generation , Proc. ACM Softw. Eng. 1 (2024), no. FSE. URL https://doi.org/10.1145/3660783

work page doi:10.1145/3660783 2024

[33] [33]

Y . Zhang et al., Citywalk: Enhancing llm-based c++ unit test generation via project-dependency awareness and language-speciﬁc knowledge , ACM Transactions on Software Engineering and Methodology (2025)

work page 2025