pith. sign in

arxiv: 2511.00915 · v2 · submitted 2025-11-02 · 💻 cs.SE

Empirical Derivations from an Evolving Test Suite

Pith reviewed 2026-05-18 01:09 UTC · model grok-4.3

classification 💻 cs.SE
keywords software testingtest suite evolutionempirical analysisfailure rateslongitudinal studycontinuous integrationbuild failures
0
0 comments X

The pith

The automated test suite of an operating system has expanded to over ten thousand cases while failure rates remain largely stable over more than a decade.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper analyzes data from a long-running automated test suite spanning the early 2010s to late 2025. It establishes that the suite has grown continuously to cover more than ten thousand test cases, with failed tests showing overall stability interrupted by shorter periods of higher failure rates. The same pattern holds for build failures, incomplete test runs, and installation failures. Code churn and kernel modifications do not supply consistent statistical explanations for these outcomes, as their effects stay small on average. A sympathetic reader would care because these patterns illustrate how large, evolving test suites can be maintained without failures scaling directly with development activity.

Core claim

The empirical observations indicate that the test suite has grown continuously to cover over ten thousand individual test cases. Failed test cases show overall stability, though shorter periods with more frequent failures have occurred. The same applies to build failures, failures of the test suite to complete, and installation failures. Code churn and kernel modifications do not provide longitudinally consistent statistical explanations for the failures, with effects that are small on average despite some periods showing larger impacts.

What carries the argument

The longitudinal tracking and statistical examination of failure data, build logs, and installation records collected by the virtualization-based automated testing framework.

If this is right

  • Test suites can expand to thousands of cases while keeping overall failure rates stable.
  • Short-term spikes in failures do not necessarily reflect long-term trends in software reliability.
  • Code churn and kernel modifications produce only small average effects on failure rates across extended time periods.
  • Exploratory analysis of evolving test data can inform maintenance strategies for large-scale testing frameworks.
  • Periods of higher failures remain localized rather than indicating systemic breakdown in the testing process.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The observed stability may indicate that test case maintenance and framework updates are effectively managing growth in test volume.
  • Similar longitudinal patterns could emerge if the same approach were applied to test suites of other complex software systems.
  • Additional variables such as changes in hardware test environments or test case interdependencies might explain the remaining variation in failure rates.
  • These results suggest that efforts to reduce failures should target specific high-failure intervals rather than assuming uniform impact from all code changes.

Load-bearing premise

The captured failure data, build logs, and installation records from the testing framework accurately and consistently measure the underlying phenomena across the full longitudinal period without major changes in test execution environment or reporting practices.

What would settle it

A clear and sustained increase in failure rates that aligns precisely with a documented change in test execution environment or data reporting method, even in the absence of corresponding code or kernel modifications.

Figures

Figures reproduced from arXiv: 2511.00915 by Abhishek Tiwari, Jukka Ruohonen.

Figure 1
Figure 1. Figure 1: Monthly Sample Sizes Regarding RQ.3, the regression analyses soon described in Subsection III-D are computed primarily with unaggregated monthly data. The reason is partially practical and related to the data collection: because NetBSD provides the per￾architecture online archival files on rolling monthly basis, it is not straightforward to merge and robustly synchronize the files into annual aggregates or… view at source ↗
Figure 2
Figure 2. Figure 2: Test Cases and Failed Test Cases (monthly averages) [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Build, Test Suite, and Installation Failures (monthly sums) [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Regression Results for the amd64 Architecture (separate monthly estimates; coefficients or marginal effects of the coefficients) observation can be made about the monthly build failure sums shown in the topmost plot in [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Regression Results for the i386 Architecture (separate monthly estimates; coefficients or marginal effects of the coefficients) failure amounts. When compared to userland modifications, one or more preceding commits having touched files in the sys directory tends to increase the amount of failed amd64 test cases by nearly four tests, all other things being constant. The effect is visible also for i386, alt… view at source ↗
Figure 6
Figure 6. Figure 6: An Example of Unaggregated Time Series Finally, to return to the conclusion about modest or even small effects on average, a few remarks can be made about the pooled regression estimates. Before continuing, it should be remarked that the pooled unaggregated data yields ir￾regular time series due to the unequal sample sizes in the [PITH_FULL_IMAGE:figures/full_fig_p006_6.png] view at source ↗
read the original abstract

The paper presents a longitudinal empirical analysis of the automated, continuous, and virtualization-based software test suite of the NetBSD operating system. The longitudinal period observed spans from the initial roll out of the test suite in the early 2010s to late 2025. According to the results, the test suite has grown continuously, currently covering over ten thousand individual test cases. Failed test cases exhibit overall stability, although there have been shorter periods marked with more frequent failures. A similar observation applies to build failures, failures of the test suite to complete, and installation failures, all of which are also captured by the NetBSD's testing framework. Finally, code churn and kernel modifications do not provide longitudinally consistent statistical explanations for the failures. Although some periods exhibit larger effects, including particularly with respect to the kernel modifications, the effects are small on average. Even though only in an exploratory manner, these empirical observations contribute to efforts to draw conclusions from large-scale and evolving software test suites.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 1 minor

Summary. The paper presents a longitudinal empirical analysis of the NetBSD operating system's automated, continuous, and virtualization-based test suite from the early 2010s to late 2025. It reports that the suite has grown continuously to over ten thousand test cases, that failed test cases show overall stability with episodic spikes of higher failures, that similar patterns hold for build failures, test-suite completion failures, and installation failures, and that code churn and kernel modifications provide no longitudinally consistent statistical explanations for failures, with only small average effects.

Significance. If the captured failure data remain comparable across the full observation window, the work supplies useful descriptive observations on the long-term behavior of a large, real-world test suite in an open-source operating system. Such empirical patterns can inform broader research on test-suite evolution, maintenance costs, and the limited explanatory power of code-change metrics in complex systems.

major comments (1)
  1. [Data sources and analysis (implicit in results description)] The manuscript supplies no description of data collection procedures, aggregation methods, statistical models, or controls for changes in the testing framework, virtualization environment, or reporting practices across the ~15-year span. This is load-bearing for the central observational claims of failure stability and small average effects from churn/kernel modifications, because any systematic shift in measurement could produce the reported patterns artifactually.
minor comments (1)
  1. [Abstract] The abstract states the exploratory framing but does not indicate the time granularity of the longitudinal data or the precise definition of 'code churn' used in the correlation analysis.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their careful review and for recommending major revision. The primary concern about insufficient description of data sources, collection procedures, aggregation, statistical models, and controls for framework changes over the observation period is a valid one that we address directly below. We will revise the manuscript to improve transparency and reproducibility while preserving the core empirical contributions.

read point-by-point responses
  1. Referee: The manuscript supplies no description of data collection procedures, aggregation methods, statistical models, or controls for changes in the testing framework, virtualization environment, or reporting practices across the ~15-year span. This is load-bearing for the central observational claims of failure stability and small average effects from churn/kernel modifications, because any systematic shift in measurement could produce the reported patterns artifactually.

    Authors: We agree that the manuscript would be strengthened by an explicit account of these elements, as longitudinal comparability is central to interpreting stability and effect sizes. The underlying data derive from NetBSD's public, continuously operating testing infrastructure. In the revised manuscript we will insert a dedicated Data and Methods section that details: (1) the precise sources and extraction process from test-suite logs and related reports; (2) aggregation rules (e.g., daily summaries versus per-commit); (3) the regression and time-series models used to evaluate code-churn and kernel-modification effects, including variable definitions and controls; and (4) documented changes in virtualization, reporting formats, or test harnesses across the window, together with sensitivity analyses that test whether such changes could artifactually produce the observed patterns. These additions will directly address the referee's concern about potential measurement shifts. revision: yes

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper is a purely observational longitudinal empirical analysis of NetBSD test suite growth and failure patterns, reporting descriptive statistics such as continuous growth to over ten thousand test cases, overall failure stability with episodic spikes, and small average effects from code churn or kernel modifications. No equations, fitted parameters, predictions, derivations, or self-citations appear in the load-bearing claims. All results rest on direct measurement from the testing framework's data logs without reduction to prior inputs or ansatzes, rendering the analysis self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No free parameters, axioms, or invented entities are introduced; the work rests entirely on observed longitudinal data from the NetBSD testing framework.

pith-pipeline@v0.9.0 · 5694 in / 1052 out tokens · 32837 ms · 2026-05-18T01:09:58.614575+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

30 extracted references · 30 canonical work pages

  1. [1]

    Empirical Notes on the Interaction Be- tween Continuous Kernel Fuzzing and Development,

    J. Ruohonen and K. Rindell, “Empirical Notes on the Interaction Be- tween Continuous Kernel Fuzzing and Development,” inProceedings of the IEEE International Symposium on Software Reliability Engineering Workshops (ISSREW 2019). Berlin: IEEE, 2019, pp. 276–281

  2. [2]

    Testing NetBSD: Easy Does It,

    A. Kantee, “Testing NetBSD: Easy Does It,” 2010, NetBSD Blog, avail- able online in October 2025: https://blog.netbsd.org/tnf/entry/testing netbsd easy does it

  3. [3]

    Testing NetBSD Automagically,

    M. Husemann, “Testing NetBSD Automagically,” 2011, EuroBSD- Con, available online in October 2025: https://www.netbsd.org/gallery/ presentations/martin/eurobsdcon2011/testing netbsd automagically.pdf

  4. [4]

    Robust- ness Testing of Embedded Software Systems: An Industrial Interview Study,

    S. M. A. Shah, D. Sundmark, B. Lindstr ¨om, and S. F. Andler, “Robust- ness Testing of Embedded Software Systems: An Industrial Interview Study,”IEEE Access, vol. 4, 2016

  5. [5]

    Time Series Trends in Software Evolution,

    J. Ruohonen, S. Hyrynsalmi, and V . Lepp ¨anen, “Time Series Trends in Software Evolution,”Journal of Software: Evolution and Process, vol. 27, no. 2, pp. 990–1015, 2015

  6. [6]

    Applying Metrics to Identify and Monitor Technical Debt Items During Software Evolution,

    C. A. Siebra, A. Cavalcanti, F. Q. Silva, A. L. Santos, and T. B. Gou- veia, “Applying Metrics to Identify and Monitor Technical Debt Items During Software Evolution,” inProceedings of the IEEE International Symposium on Software Reliability Engineering Workshops (ISSRE Wksp 2014). Naples: IEEE, 2014, pp. 92–95

  7. [7]

    Software Evolution and Time Series Volatility: An Empirical Exploration,

    J. Ruohonen, S. Hyrynsalmi, and V . Lepp ¨anen, “Software Evolution and Time Series Volatility: An Empirical Exploration,” inProceedings of the 14th International Workshop on Principles of Software Evolution (IWPSE 2015). Bergamo: ACM, 2015, pp. 56–65

  8. [8]

    Open Source Software Development: A Case Study of FreeBSD,

    T. Dinh-Trong and J. M. Bieman, “Open Source Software Development: A Case Study of FreeBSD,” inProceedings of the 10th International Symposium on Software Metrics, 2004, pp. 96–105

  9. [9]

    Comparing Fine-Grained Surce Code Changes and Code Churn for Bug Prediction,

    E. Giger, M. Pinzger, and H. C. Gall, “Comparing Fine-Grained Surce Code Changes and Code Churn for Bug Prediction,” inProceedings of the 8th Working Conference on Mining Software Repositories (MSR 2011). Waikiki, Honolulu: ACM, 2011, pp. 83–92

  10. [10]

    Better Together: Comparing Vulnerability Prediction Models,

    C. Theisen and L. Williams, “Better Together: Comparing Vulnerability Prediction Models,”Information and Software Technology, vol. 119, p. 106204, 2020

  11. [11]

    Is Kernel Code Different From Non-Kernel Code? A Case Study of BSD Family Operating Systems,

    G. Kudrjavets, J. Thomas, N. Nagappan, and A. Rastogi, “Is Kernel Code Different From Non-Kernel Code? A Case Study of BSD Family Operating Systems,” inProceedings of the IEEE International Confer- ence on Software Maintenance and Evolution (ICSME 2022), Limassol, 2022, pp. 211–222

  12. [12]

    Flexibility in Research Designs in Empirical Software Engineering,

    V . B. Kampenes, B. Anda, and T. Dybøa, “Flexibility in Research Designs in Empirical Software Engineering,” inProceedings of the 12th International Conference on Evaluation and Assessment in Software Engineering (EASE 2008). Swindon: BCS Learning & Development Ltd., 2008, pp. 49–57

  13. [13]

    Empirical Study of Software Test Suite Evolution,

    W. Aljedaani and Y . Javed, “Empirical Study of Software Test Suite Evolution,” inProceedings of the 6th Conference on Data Science and Machine Learning Applications (CDMA 2020). Riyadh: IEEE, 2020, pp. 87–93

  14. [14]

    Understanding Myths and Realities of Test-Suite Edevolution,

    L. S. Pinto, S. Sinha, and A. Orso, “Understanding Myths and Realities of Test-Suite Edevolution,” inProceedings of the ACM SIGSOFT 20th International Symposium on the Foundations of Software Engineering (FSE 2012). Cary North Carolina: ACM, 2012, pp. 1–11

  15. [15]

    Offical Test Suites,

    The NetBSD Foundation, Inc., “Offical Test Suites,” 2025, available online in October 2025: https://releng.netbsd.org/test-results.html

  16. [16]

    A Dataset for a Paper Entitled “Empirical Derivations from an Evolving Test Suite

    J. Ruohonen and A. Tiwari, “A Dataset for a Paper Entitled “Empirical Derivations from an Evolving Test Suite”,” 2025, Zenodo, available online in October 2025: https://doi.org/10.5281/zenodo.17461883

  17. [17]

    Platforms Supported by NetBSD,

    The NetBSD Foundation, Inc., “Platforms Supported by NetBSD,” 2025, available online in October 2025: https://wiki.netbsd.org/ports/

  18. [18]

    Offical Test Suites (amd64for January 2025),

    ——, “Offical Test Suites (amd64for January 2025),” 2025, avail- able online in October 2025: https://releng.netbsd.org/b5reports/amd64/ commits-2025.01.html

  19. [19]

    What Is the Vocabulary of Flaky Tests?

    G. Pinto, B. Miranda, S. Dissanayake, M. d’Amorim, C. Treude, and A. Bertolino, “What Is the Vocabulary of Flaky Tests?” inProceedings of the 17th International Conference on Mining Software Repositories (MSR 2020). Seoul: ACM, 2020, pp. 492–502

  20. [20]

    Witte,Metrics for Test Reporting: Analysis and Reporting for Effec- tive Test Management

    F. Witte,Metrics for Test Reporting: Analysis and Reporting for Effec- tive Test Management. Wiesbaden: Springer, 2024

  21. [21]

    Are We Speeding Up or Slowing Down? On Temporal Aspects of Code Velocity,

    G. Kudrjavets, N. Nagappan, and A. Rastogi, “Are We Speeding Up or Slowing Down? On Temporal Aspects of Code Velocity,” inProceedings of the IEEE/ACM 20th International Conference on Mining Software Repositories (MSR 2023), Melbourne, 2023, pp. 267–271

  22. [22]

    Assessing the Relationship between Software Assertions and Faults: An Empirical Investigation,

    G. Kudrjavets, N. Nagappan, and T. Ball, “Assessing the Relationship between Software Assertions and Faults: An Empirical Investigation,” in Proceedings of the 17th International Symposium on Software Reliability Engineering (ISSRE 2006), Raleigh, 2006, pp. 204–212

  23. [23]

    A Time Series Analysis of Assertions in the Linux Kernel,

    J. Ruohonen, “A Time Series Analysis of Assertions in the Linux Kernel,” inProceedings of the 37th International Conference on Testing Software and Systems (ICTSS 2025), Lecture Notes in Computer Science (Volume 16107), S. Bonfanti and G. A. Papadopoulos, Eds. Limassol: Springer, 2026, pp. 3–15

  24. [24]

    Has Complexity of EU Law Increased?

    ——, “Has Complexity of EU Law Increased?” 2025, archived manuscript, available online in October 2025: https://doi.org/10.31235/ osf.io/82uan v1

  25. [25]

    Use of Ranks in One-Criterion Variance Analysis,

    W. H. Kruskal and W. A. Wallis, “Use of Ranks in One-Criterion Variance Analysis,”Journal of the American Statistical Association, vol. 47, no. 260, pp. 583–621, 1952

  26. [26]

    tscount: An R Package for Analysis of Count Time Series Following Generalized Linear Models,

    T. Liboschik, K. Fokianos, and R. Fried, “tscount: An R Package for Analysis of Count Time Series Following Generalized Linear Models,” Journal of Statistical Software, vol. 82, no. 5, pp. 1–51, 2017

  27. [27]

    Dissecting Code Features: An Evolutionary Analysis of Kernel Versus Nonkernel Code in Operating Systems,

    Y . Zhao, C. Li, Z. Chen, and Z. Ding, “Dissecting Code Features: An Evolutionary Analysis of Kernel Versus Nonkernel Code in Operating Systems,”Journal of Software: Evolution and Process, vol. 37, no. 1, p. e2752, 2025

  28. [28]

    Test Co-Evolution in Soft- ware Projects: A Large-Scale Empirical Study,

    C. Miranda, G. Avelino, and P. S. Neto, “Test Co-Evolution in Soft- ware Projects: A Large-Scale Empirical Study,”Journal of Software: Evolution and Process, vol. 37, no. 7, p. e70035, 2025

  29. [29]

    Foreword to the Special Section on Negative Results in Software Engineering,

    R. F. Paige, J. Cabot, and N. A. Ernst, “Foreword to the Special Section on Negative Results in Software Engineering,”Empirical Software Engineering, vol. 22, pp. 2453–2456, 2017

  30. [30]

    Negative Results in Computer Vision: A Perspective,

    A. Borji, “Negative Results in Computer Vision: A Perspective,”Image and Vision Computing, vol. 69, pp. 1–8, 2018