StationarityToolkit: Comprehensive Time Series Stationarity Analysis in Python

Bhanu Suraj Malla; Yuqing Hu

arxiv: 2604.08676 · v1 · submitted 2026-04-09 · 📊 stat.ME

StationarityToolkit: Comprehensive Time Series Stationarity Analysis in Python

Bhanu Suraj Malla , Yuqing Hu This is my paper

Pith reviewed 2026-05-10 16:46 UTC · model grok-4.3

classification 📊 stat.ME

keywords time seriesstationaritystatistical testsPython librarytrendvarianceseasonalitydiagnostics

0 comments

The pith

A Python library runs ten statistical tests across trend, variance, and seasonality to diagnose non-stationarity in time series data with detailed reports instead of binary verdicts.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

StationarityToolkit is a Python library that performs ten statistical tests divided into trend, variance, and seasonality categories to analyze time series data. This approach matters because different forms of non-stationarity require distinct tests and transformations, and a single test often fails to distinguish them. The library automatically infers frequency from a datetime index, supplies test statistics, p-values, and actionable notes for each result, and supports an iterative test-transform-retest process. Users can therefore identify the specific type of non-stationarity present rather than receiving only a yes-or-no answer. By grouping tests into three categories the toolkit aims to cover the main sources of non-stationarity encountered in practical forecasting and analysis tasks.

Core claim

The paper presents StationarityToolkit as a comprehensive Python library that executes 10 statistical tests—four for trends, four for variance changes, and two for seasonality—on time series that include a datetime index. It infers the series frequency automatically, reports test statistics and p-values with clear interpretations, and adds actionable notes on what each detection implies, enabling users to apply targeted transformations and retest until the series satisfies stationarity assumptions for downstream modeling.

What carries the argument

The StationarityToolkit library, which categorizes and orchestrates ten stationarity tests to generate diagnostic outputs with statistics, p-values, and transformation recommendations.

Load-bearing premise

The ten selected tests together capture the main types of non-stationarity found in real data and that automatic frequency inference from datetime indices works reliably across different formats and sampling patterns.

What would settle it

A time series known to contain a structural break or variance shift that none of the ten tests flags, yet subsequent forecasting models trained on the data show clear degradation attributable to undetected non-stationarity.

Figures

Figures reproduced from arXiv: 2604.08676 by Bhanu Suraj Malla, Yuqing Hu.

**Figure 2.** Figure 2: Example output: StationarityToolkit summary and detailed test results [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

read the original abstract

Time-series stationarity is a property that statistical characteristics such as trend, variance, seasonality remain constant over time. It is considered fundamental to many forecasting and analysis methods. Different tests detect different types of non-stationarity: structural breaks or deterministic trends, clustered or time-dependent variance, stochastic or deterministic seasonality. A series might pass one test while failing another; single-test approaches seldom distinguish between conceptually different types of non-stationarity that require different types of tests and transformations. `StationarityToolkit` addresses this by providing a comprehensive Python library that runs 10 statistical tests across three categories: trend (4 tests), variance (4 tests), and seasonality (2 tests). Rather than a binary stationary/non-stationary verdict, users receive detailed diagnostics with actionable notes for each detection. The toolkit automatically infers the frequency of the data provided (requires datetime index), provides clear interpretations with test statistics and p-values, and supports an iterative test-transform-retest workflow essential for real-world data sets.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This is a practical Python wrapper for ten standard stationarity tests with grouped diagnostics and auto frequency detection, but it introduces no new methods or theory.

read the letter

The punchline for this paper is that it offers a Python library called StationarityToolkit which combines ten existing statistical tests for time series stationarity into categories for trend, variance, and seasonality, complete with automatic frequency detection and diagnostic notes rather than simple pass/fail results. What stands out as useful is the focus on real-world application. Time series data often has multiple forms of non-stationarity at once, and running separate tests manually can be tedious. By grouping them and providing interpretations with test statistics, p-values, and suggested actions, the toolkit supports the kind of iterative process that analysts actually follow when preparing data for models. The automatic inference of frequency from a datetime index is a nice touch for convenience. On the softer side, the contribution is limited to packaging and workflow. No new tests are derived, no improvements to existing ones are proposed, and the abstract does not detail any novel handling for common problems like missing data or non-standard sampling. The reliability of the frequency detection across varied inputs remains an open question that would need checking in the code or examples. Since the full manuscript likely includes implementation details, but from what's described, the soundness depends on faithful reproduction of the standard tests. This kind of work is for applied statisticians, data scientists, and forecasters who use Python and want a streamlined way to assess stationarity before proceeding with ARIMA or other methods. It won't change how experts think about the tests, but it could reduce boilerplate code and help less experienced users get better diagnostics. Overall, I recommend putting it through peer review. A software paper like this can benefit from referees checking the code quality, test coverage, and whether the diagnostics are clear and accurate in practice.

Referee Report

1 major / 2 minor

Summary. The manuscript introduces StationarityToolkit, a Python library that applies 10 established statistical tests for time-series stationarity, grouped into trend (4 tests), variance (4 tests), and seasonality (2 tests). It returns detailed per-test diagnostics with statistics, p-values, and actionable notes rather than binary verdicts, includes automatic frequency inference from datetime indices, and supports iterative test-transform-retest workflows.

Significance. If the implementation is correct and handles edge cases reliably, the toolkit could offer practical value to applied statisticians and forecasters by enabling more nuanced, multi-faceted stationarity diagnostics than single-test approaches commonly used in preprocessing pipelines.

major comments (1)

[Abstract] Abstract: the central claim that the toolkit 'automatically infers the frequency of the data provided (requires datetime index)' and supports real-world iterative workflows is load-bearing, yet no description is given of the inference algorithm, its handling of missing values, irregular sampling, or non-standard datetime formats; without this, the reliability of the advertised functionality cannot be assessed.

minor comments (2)

The first sentence of the abstract is grammatically incomplete ('a property that statistical characteristics such as trend, variance, seasonality remain constant'); rephrasing for clarity would improve readability.
[Abstract] Explicitly naming the 10 tests and citing their original references (e.g., ADF, KPSS, etc.) would allow readers to evaluate coverage without inspecting the source code.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript describing StationarityToolkit. The point raised about insufficient documentation of the frequency inference mechanism is well-taken, and we address it directly below.

read point-by-point responses

Referee: [Abstract] Abstract: the central claim that the toolkit 'automatically infers the frequency of the data provided (requires datetime index)' and supports real-world iterative workflows is load-bearing, yet no description is given of the inference algorithm, its handling of missing values, irregular sampling, or non-standard datetime formats; without this, the reliability of the advertised functionality cannot be assessed.

Authors: We agree that the manuscript does not provide sufficient detail on the frequency inference procedure, which limits the ability to evaluate its robustness. The current implementation uses pandas.infer_freq as the core method, supplemented by custom logic to compute intervals from the datetime index after converting via pandas.to_datetime (with infer_datetime_format=True for format flexibility). For missing values, NaT entries are dropped before inference, and a warning is issued if they exceed 5% of observations; no imputation is performed automatically. Irregular sampling is detected by checking the standard deviation of consecutive time deltas against a tolerance threshold (default 1e-6 relative to the median delta), triggering a warning and fallback to a user-provided freq parameter if inconsistency is found. Non-standard formats are handled through pandas parsing, supporting ISO 8601, common regional variants, and explicit format strings. In the revised manuscript we will add a new subsection in the Implementation section with pseudocode, edge-case examples, and explicit discussion of these behaviors. This documentation will also clarify how the inferred frequency enables the iterative test-transform-retest workflow by ensuring consistent re-indexing after transformations such as differencing or deseasonalization. The package code already contains these safeguards, so the revision will consist of expanded description rather than new functionality. revision: yes

Circularity Check

0 steps flagged

No significant circularity; tool-description paper with no derivations

full rationale

The paper presents StationarityToolkit as a Python library that aggregates 10 established statistical tests (4 trend, 4 variance, 2 seasonality) plus automatic frequency inference from datetime indices, returning diagnostics rather than new theoretical results. No equations, fitted parameters, or derivation chain appear in the provided text; the central claim is simply that the library implements and organizes these pre-existing procedures with actionable output. This matches the reader's assessment of zero circularity and satisfies the hard rule that circularity is only flagged when a specific reduction to inputs can be quoted. The work is self-contained as a software contribution without self-citation load-bearing or self-definitional steps.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The contribution is a software implementation of standard tests. No free parameters are fitted, no new axioms are introduced, and no invented entities are postulated.

pith-pipeline@v0.9.0 · 5467 in / 1076 out tokens · 55478 ms · 2026-05-10T16:46:12.926295+00:00 · methodology

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Do Stationarity Transformations Actually Improve Time Series Forecasts? A Controlled Experimental Evaluation
stat.ME 2026-05 unverdicted novelty 7.0

Large-scale experiments on synthetic data find stationarity transformations improve forecasts in only 18% of matched cases, with variance stabilization as the main exception and signal attenuation as the mechanism.

Reference graph

Works this paper leans on

15 extracted references · 15 canonical work pages · cited by 1 Pith paper

[1]

Bartlett, M. S. (1937). Properties of sufficiency and statistical tests.Proceedings of the Royal Society A,160(901), 268–282

work page 1937
[2]

Box, G. E. P., & Cox, D. R. (1964). An analysis of transformations.Journal of the Royal Statistical Society: Series B,26(2), 211–252

work page 1964
[3]

B., Cleveland, W

Cleveland, R. B., Cleveland, W. S., McRae, J. E., & Terpenning, I. (1990). STL: A seasonal-trend decomposition procedure based on loess.Journal of Official 5 Statistics,6(1), 3–33

work page 1990
[4]

A., & Fuller, W

Dickey, D. A., & Fuller, W. A. (1979). Distribution of the estimators for autoregressive time series with a unit root.Journal of the American Statistical Association,74(366a), 427–431

work page 1979
[5]

Engle, R. F. (1982). Autoregressive conditional heteroscedasticity with estimates of the variance of united kingdom inflation.Econometrica,50(4), 987–1007

work page 1982
[6]

R., Millman, K

Harris, C. R., Millman, K. J., Walt, S. J. van der, & others. (2020). Array programming with NumPy.Nature,585(7825), 357–362. https://doi.org/10 .1038/s41586-020-2649-2

work page 2020
[7]

Kwiatkowski, D., Phillips, P. C. B., Schmidt, P., & Shin, Y. (1992). Testing the null hypothesis of stationarity against the alternative of a unit root.Journal of Econometrics,54(1-3), 159–178

work page 1992
[8]

Levene, H. (1960). Robust tests for equality of variances. In I. Olkin (Ed.), Contributions to probability and statistics(pp. 278–292). Stanford University Press

work page 1960
[9]

McKinney, W. (2010). Data structures for statistical computing in python. Proceedings of the 9th Python in Science Conference,445, 56–61. https: //doi.org/10.25080/Majora-92bf1922-00a

work page doi:10.25080/majora-92bf1922-00a 2010
[10]

Phillips, P. C. B., & Perron, P. (1988). Testing for a unit root in time series regression.Biometrika,75(2), 335–346

work page 1988
[11]

Seabold, S., & Perktold, J. (2010). Statsmodels: Econometric and statistical modeling with python.9th Python in Science Conference

work page 2010
[12]

(2017).Arch: ARCH models in python

Sheppard, K. (2017).Arch: ARCH models in python. Zenodo. https://doi.org/ 10.5281/zenodo.593254

work page doi:10.5281/zenodo.593254 2017
[13]

Smith, T. G. (2015).Pmdarima(Version 2.1.1). https://github.com/alkaline- ml/pmdarima

work page 2015
[14]

E., et al

Virtanen, P., Gommers, R., Oliphant, T. E., & others. (2020). SciPy 1.0: Fundamental algorithms for scientific computing in python.Nature Methods, 17(3), 261–272. https://doi.org/10.1038/s41592-019-0686-2

work page doi:10.1038/s41592-019-0686-2 2020
[15]

Zivot, E., & Andrews, D. W. K. (1992). Further evidence on the great crash, the oil-price shock, and the unit-root hypothesis.Journal of Business & Economic Statistics,10(3), 251–270. 6

work page 1992

[1] [1]

Bartlett, M. S. (1937). Properties of sufficiency and statistical tests.Proceedings of the Royal Society A,160(901), 268–282

work page 1937

[2] [2]

Box, G. E. P., & Cox, D. R. (1964). An analysis of transformations.Journal of the Royal Statistical Society: Series B,26(2), 211–252

work page 1964

[3] [3]

B., Cleveland, W

Cleveland, R. B., Cleveland, W. S., McRae, J. E., & Terpenning, I. (1990). STL: A seasonal-trend decomposition procedure based on loess.Journal of Official 5 Statistics,6(1), 3–33

work page 1990

[4] [4]

A., & Fuller, W

Dickey, D. A., & Fuller, W. A. (1979). Distribution of the estimators for autoregressive time series with a unit root.Journal of the American Statistical Association,74(366a), 427–431

work page 1979

[5] [5]

Engle, R. F. (1982). Autoregressive conditional heteroscedasticity with estimates of the variance of united kingdom inflation.Econometrica,50(4), 987–1007

work page 1982

[6] [6]

R., Millman, K

Harris, C. R., Millman, K. J., Walt, S. J. van der, & others. (2020). Array programming with NumPy.Nature,585(7825), 357–362. https://doi.org/10 .1038/s41586-020-2649-2

work page 2020

[7] [7]

Kwiatkowski, D., Phillips, P. C. B., Schmidt, P., & Shin, Y. (1992). Testing the null hypothesis of stationarity against the alternative of a unit root.Journal of Econometrics,54(1-3), 159–178

work page 1992

[8] [8]

Levene, H. (1960). Robust tests for equality of variances. In I. Olkin (Ed.), Contributions to probability and statistics(pp. 278–292). Stanford University Press

work page 1960

[9] [9]

McKinney, W. (2010). Data structures for statistical computing in python. Proceedings of the 9th Python in Science Conference,445, 56–61. https: //doi.org/10.25080/Majora-92bf1922-00a

work page doi:10.25080/majora-92bf1922-00a 2010

[10] [10]

Phillips, P. C. B., & Perron, P. (1988). Testing for a unit root in time series regression.Biometrika,75(2), 335–346

work page 1988

[11] [11]

Seabold, S., & Perktold, J. (2010). Statsmodels: Econometric and statistical modeling with python.9th Python in Science Conference

work page 2010

[12] [12]

(2017).Arch: ARCH models in python

Sheppard, K. (2017).Arch: ARCH models in python. Zenodo. https://doi.org/ 10.5281/zenodo.593254

work page doi:10.5281/zenodo.593254 2017

[13] [13]

Smith, T. G. (2015).Pmdarima(Version 2.1.1). https://github.com/alkaline- ml/pmdarima

work page 2015

[14] [14]

E., et al

Virtanen, P., Gommers, R., Oliphant, T. E., & others. (2020). SciPy 1.0: Fundamental algorithms for scientific computing in python.Nature Methods, 17(3), 261–272. https://doi.org/10.1038/s41592-019-0686-2

work page doi:10.1038/s41592-019-0686-2 2020

[15] [15]

Zivot, E., & Andrews, D. W. K. (1992). Further evidence on the great crash, the oil-price shock, and the unit-root hypothesis.Journal of Business & Economic Statistics,10(3), 251–270. 6

work page 1992