Empirical Derivations from an Evolving Test Suite
Pith reviewed 2026-05-18 01:09 UTC · model grok-4.3
The pith
The automated test suite of an operating system has expanded to over ten thousand cases while failure rates remain largely stable over more than a decade.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The empirical observations indicate that the test suite has grown continuously to cover over ten thousand individual test cases. Failed test cases show overall stability, though shorter periods with more frequent failures have occurred. The same applies to build failures, failures of the test suite to complete, and installation failures. Code churn and kernel modifications do not provide longitudinally consistent statistical explanations for the failures, with effects that are small on average despite some periods showing larger impacts.
What carries the argument
The longitudinal tracking and statistical examination of failure data, build logs, and installation records collected by the virtualization-based automated testing framework.
If this is right
- Test suites can expand to thousands of cases while keeping overall failure rates stable.
- Short-term spikes in failures do not necessarily reflect long-term trends in software reliability.
- Code churn and kernel modifications produce only small average effects on failure rates across extended time periods.
- Exploratory analysis of evolving test data can inform maintenance strategies for large-scale testing frameworks.
- Periods of higher failures remain localized rather than indicating systemic breakdown in the testing process.
Where Pith is reading between the lines
- The observed stability may indicate that test case maintenance and framework updates are effectively managing growth in test volume.
- Similar longitudinal patterns could emerge if the same approach were applied to test suites of other complex software systems.
- Additional variables such as changes in hardware test environments or test case interdependencies might explain the remaining variation in failure rates.
- These results suggest that efforts to reduce failures should target specific high-failure intervals rather than assuming uniform impact from all code changes.
Load-bearing premise
The captured failure data, build logs, and installation records from the testing framework accurately and consistently measure the underlying phenomena across the full longitudinal period without major changes in test execution environment or reporting practices.
What would settle it
A clear and sustained increase in failure rates that aligns precisely with a documented change in test execution environment or data reporting method, even in the absence of corresponding code or kernel modifications.
Figures
read the original abstract
The paper presents a longitudinal empirical analysis of the automated, continuous, and virtualization-based software test suite of the NetBSD operating system. The longitudinal period observed spans from the initial roll out of the test suite in the early 2010s to late 2025. According to the results, the test suite has grown continuously, currently covering over ten thousand individual test cases. Failed test cases exhibit overall stability, although there have been shorter periods marked with more frequent failures. A similar observation applies to build failures, failures of the test suite to complete, and installation failures, all of which are also captured by the NetBSD's testing framework. Finally, code churn and kernel modifications do not provide longitudinally consistent statistical explanations for the failures. Although some periods exhibit larger effects, including particularly with respect to the kernel modifications, the effects are small on average. Even though only in an exploratory manner, these empirical observations contribute to efforts to draw conclusions from large-scale and evolving software test suites.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper presents a longitudinal empirical analysis of the NetBSD operating system's automated, continuous, and virtualization-based test suite from the early 2010s to late 2025. It reports that the suite has grown continuously to over ten thousand test cases, that failed test cases show overall stability with episodic spikes of higher failures, that similar patterns hold for build failures, test-suite completion failures, and installation failures, and that code churn and kernel modifications provide no longitudinally consistent statistical explanations for failures, with only small average effects.
Significance. If the captured failure data remain comparable across the full observation window, the work supplies useful descriptive observations on the long-term behavior of a large, real-world test suite in an open-source operating system. Such empirical patterns can inform broader research on test-suite evolution, maintenance costs, and the limited explanatory power of code-change metrics in complex systems.
major comments (1)
- [Data sources and analysis (implicit in results description)] The manuscript supplies no description of data collection procedures, aggregation methods, statistical models, or controls for changes in the testing framework, virtualization environment, or reporting practices across the ~15-year span. This is load-bearing for the central observational claims of failure stability and small average effects from churn/kernel modifications, because any systematic shift in measurement could produce the reported patterns artifactually.
minor comments (1)
- [Abstract] The abstract states the exploratory framing but does not indicate the time granularity of the longitudinal data or the precise definition of 'code churn' used in the correlation analysis.
Simulated Author's Rebuttal
We thank the referee for their careful review and for recommending major revision. The primary concern about insufficient description of data sources, collection procedures, aggregation, statistical models, and controls for framework changes over the observation period is a valid one that we address directly below. We will revise the manuscript to improve transparency and reproducibility while preserving the core empirical contributions.
read point-by-point responses
-
Referee: The manuscript supplies no description of data collection procedures, aggregation methods, statistical models, or controls for changes in the testing framework, virtualization environment, or reporting practices across the ~15-year span. This is load-bearing for the central observational claims of failure stability and small average effects from churn/kernel modifications, because any systematic shift in measurement could produce the reported patterns artifactually.
Authors: We agree that the manuscript would be strengthened by an explicit account of these elements, as longitudinal comparability is central to interpreting stability and effect sizes. The underlying data derive from NetBSD's public, continuously operating testing infrastructure. In the revised manuscript we will insert a dedicated Data and Methods section that details: (1) the precise sources and extraction process from test-suite logs and related reports; (2) aggregation rules (e.g., daily summaries versus per-commit); (3) the regression and time-series models used to evaluate code-churn and kernel-modification effects, including variable definitions and controls; and (4) documented changes in virtualization, reporting formats, or test harnesses across the window, together with sensitivity analyses that test whether such changes could artifactually produce the observed patterns. These additions will directly address the referee's concern about potential measurement shifts. revision: yes
Circularity Check
No significant circularity
full rationale
The paper is a purely observational longitudinal empirical analysis of NetBSD test suite growth and failure patterns, reporting descriptive statistics such as continuous growth to over ten thousand test cases, overall failure stability with episodic spikes, and small average effects from code churn or kernel modifications. No equations, fitted parameters, predictions, derivations, or self-citations appear in the load-bearing claims. All results rest on direct measurement from the testing framework's data logs without reduction to prior inputs or ansatzes, rendering the analysis self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
The test suite has grown continuously, currently covering over ten thousand individual test cases. Failed test cases exhibit overall stability... code churn and kernel modifications do not provide longitudinally consistent statistical explanations for the failures.
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Empirical Notes on the Interaction Be- tween Continuous Kernel Fuzzing and Development,
J. Ruohonen and K. Rindell, “Empirical Notes on the Interaction Be- tween Continuous Kernel Fuzzing and Development,” inProceedings of the IEEE International Symposium on Software Reliability Engineering Workshops (ISSREW 2019). Berlin: IEEE, 2019, pp. 276–281
work page 2019
-
[2]
A. Kantee, “Testing NetBSD: Easy Does It,” 2010, NetBSD Blog, avail- able online in October 2025: https://blog.netbsd.org/tnf/entry/testing netbsd easy does it
work page 2010
-
[3]
M. Husemann, “Testing NetBSD Automagically,” 2011, EuroBSD- Con, available online in October 2025: https://www.netbsd.org/gallery/ presentations/martin/eurobsdcon2011/testing netbsd automagically.pdf
work page 2011
-
[4]
Robust- ness Testing of Embedded Software Systems: An Industrial Interview Study,
S. M. A. Shah, D. Sundmark, B. Lindstr ¨om, and S. F. Andler, “Robust- ness Testing of Embedded Software Systems: An Industrial Interview Study,”IEEE Access, vol. 4, 2016
work page 2016
-
[5]
Time Series Trends in Software Evolution,
J. Ruohonen, S. Hyrynsalmi, and V . Lepp ¨anen, “Time Series Trends in Software Evolution,”Journal of Software: Evolution and Process, vol. 27, no. 2, pp. 990–1015, 2015
work page 2015
-
[6]
Applying Metrics to Identify and Monitor Technical Debt Items During Software Evolution,
C. A. Siebra, A. Cavalcanti, F. Q. Silva, A. L. Santos, and T. B. Gou- veia, “Applying Metrics to Identify and Monitor Technical Debt Items During Software Evolution,” inProceedings of the IEEE International Symposium on Software Reliability Engineering Workshops (ISSRE Wksp 2014). Naples: IEEE, 2014, pp. 92–95
work page 2014
-
[7]
Software Evolution and Time Series Volatility: An Empirical Exploration,
J. Ruohonen, S. Hyrynsalmi, and V . Lepp ¨anen, “Software Evolution and Time Series Volatility: An Empirical Exploration,” inProceedings of the 14th International Workshop on Principles of Software Evolution (IWPSE 2015). Bergamo: ACM, 2015, pp. 56–65
work page 2015
-
[8]
Open Source Software Development: A Case Study of FreeBSD,
T. Dinh-Trong and J. M. Bieman, “Open Source Software Development: A Case Study of FreeBSD,” inProceedings of the 10th International Symposium on Software Metrics, 2004, pp. 96–105
work page 2004
-
[9]
Comparing Fine-Grained Surce Code Changes and Code Churn for Bug Prediction,
E. Giger, M. Pinzger, and H. C. Gall, “Comparing Fine-Grained Surce Code Changes and Code Churn for Bug Prediction,” inProceedings of the 8th Working Conference on Mining Software Repositories (MSR 2011). Waikiki, Honolulu: ACM, 2011, pp. 83–92
work page 2011
-
[10]
Better Together: Comparing Vulnerability Prediction Models,
C. Theisen and L. Williams, “Better Together: Comparing Vulnerability Prediction Models,”Information and Software Technology, vol. 119, p. 106204, 2020
work page 2020
-
[11]
Is Kernel Code Different From Non-Kernel Code? A Case Study of BSD Family Operating Systems,
G. Kudrjavets, J. Thomas, N. Nagappan, and A. Rastogi, “Is Kernel Code Different From Non-Kernel Code? A Case Study of BSD Family Operating Systems,” inProceedings of the IEEE International Confer- ence on Software Maintenance and Evolution (ICSME 2022), Limassol, 2022, pp. 211–222
work page 2022
-
[12]
Flexibility in Research Designs in Empirical Software Engineering,
V . B. Kampenes, B. Anda, and T. Dybøa, “Flexibility in Research Designs in Empirical Software Engineering,” inProceedings of the 12th International Conference on Evaluation and Assessment in Software Engineering (EASE 2008). Swindon: BCS Learning & Development Ltd., 2008, pp. 49–57
work page 2008
-
[13]
Empirical Study of Software Test Suite Evolution,
W. Aljedaani and Y . Javed, “Empirical Study of Software Test Suite Evolution,” inProceedings of the 6th Conference on Data Science and Machine Learning Applications (CDMA 2020). Riyadh: IEEE, 2020, pp. 87–93
work page 2020
-
[14]
Understanding Myths and Realities of Test-Suite Edevolution,
L. S. Pinto, S. Sinha, and A. Orso, “Understanding Myths and Realities of Test-Suite Edevolution,” inProceedings of the ACM SIGSOFT 20th International Symposium on the Foundations of Software Engineering (FSE 2012). Cary North Carolina: ACM, 2012, pp. 1–11
work page 2012
-
[15]
The NetBSD Foundation, Inc., “Offical Test Suites,” 2025, available online in October 2025: https://releng.netbsd.org/test-results.html
work page 2025
-
[16]
A Dataset for a Paper Entitled “Empirical Derivations from an Evolving Test Suite
J. Ruohonen and A. Tiwari, “A Dataset for a Paper Entitled “Empirical Derivations from an Evolving Test Suite”,” 2025, Zenodo, available online in October 2025: https://doi.org/10.5281/zenodo.17461883
-
[17]
Platforms Supported by NetBSD,
The NetBSD Foundation, Inc., “Platforms Supported by NetBSD,” 2025, available online in October 2025: https://wiki.netbsd.org/ports/
work page 2025
-
[18]
Offical Test Suites (amd64for January 2025),
——, “Offical Test Suites (amd64for January 2025),” 2025, avail- able online in October 2025: https://releng.netbsd.org/b5reports/amd64/ commits-2025.01.html
work page 2025
-
[19]
What Is the Vocabulary of Flaky Tests?
G. Pinto, B. Miranda, S. Dissanayake, M. d’Amorim, C. Treude, and A. Bertolino, “What Is the Vocabulary of Flaky Tests?” inProceedings of the 17th International Conference on Mining Software Repositories (MSR 2020). Seoul: ACM, 2020, pp. 492–502
work page 2020
-
[20]
Witte,Metrics for Test Reporting: Analysis and Reporting for Effec- tive Test Management
F. Witte,Metrics for Test Reporting: Analysis and Reporting for Effec- tive Test Management. Wiesbaden: Springer, 2024
work page 2024
-
[21]
Are We Speeding Up or Slowing Down? On Temporal Aspects of Code Velocity,
G. Kudrjavets, N. Nagappan, and A. Rastogi, “Are We Speeding Up or Slowing Down? On Temporal Aspects of Code Velocity,” inProceedings of the IEEE/ACM 20th International Conference on Mining Software Repositories (MSR 2023), Melbourne, 2023, pp. 267–271
work page 2023
-
[22]
Assessing the Relationship between Software Assertions and Faults: An Empirical Investigation,
G. Kudrjavets, N. Nagappan, and T. Ball, “Assessing the Relationship between Software Assertions and Faults: An Empirical Investigation,” in Proceedings of the 17th International Symposium on Software Reliability Engineering (ISSRE 2006), Raleigh, 2006, pp. 204–212
work page 2006
-
[23]
A Time Series Analysis of Assertions in the Linux Kernel,
J. Ruohonen, “A Time Series Analysis of Assertions in the Linux Kernel,” inProceedings of the 37th International Conference on Testing Software and Systems (ICTSS 2025), Lecture Notes in Computer Science (Volume 16107), S. Bonfanti and G. A. Papadopoulos, Eds. Limassol: Springer, 2026, pp. 3–15
work page 2025
-
[24]
Has Complexity of EU Law Increased?
——, “Has Complexity of EU Law Increased?” 2025, archived manuscript, available online in October 2025: https://doi.org/10.31235/ osf.io/82uan v1
work page 2025
-
[25]
Use of Ranks in One-Criterion Variance Analysis,
W. H. Kruskal and W. A. Wallis, “Use of Ranks in One-Criterion Variance Analysis,”Journal of the American Statistical Association, vol. 47, no. 260, pp. 583–621, 1952
work page 1952
-
[26]
tscount: An R Package for Analysis of Count Time Series Following Generalized Linear Models,
T. Liboschik, K. Fokianos, and R. Fried, “tscount: An R Package for Analysis of Count Time Series Following Generalized Linear Models,” Journal of Statistical Software, vol. 82, no. 5, pp. 1–51, 2017
work page 2017
-
[27]
Y . Zhao, C. Li, Z. Chen, and Z. Ding, “Dissecting Code Features: An Evolutionary Analysis of Kernel Versus Nonkernel Code in Operating Systems,”Journal of Software: Evolution and Process, vol. 37, no. 1, p. e2752, 2025
work page 2025
-
[28]
Test Co-Evolution in Soft- ware Projects: A Large-Scale Empirical Study,
C. Miranda, G. Avelino, and P. S. Neto, “Test Co-Evolution in Soft- ware Projects: A Large-Scale Empirical Study,”Journal of Software: Evolution and Process, vol. 37, no. 7, p. e70035, 2025
work page 2025
-
[29]
Foreword to the Special Section on Negative Results in Software Engineering,
R. F. Paige, J. Cabot, and N. A. Ernst, “Foreword to the Special Section on Negative Results in Software Engineering,”Empirical Software Engineering, vol. 22, pp. 2453–2456, 2017
work page 2017
-
[30]
Negative Results in Computer Vision: A Perspective,
A. Borji, “Negative Results in Computer Vision: A Perspective,”Image and Vision Computing, vol. 69, pp. 1–8, 2018
work page 2018
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.