Multiple-group (Controlled) Interrupted Time Series Analysis with Higher-Order Autoregressive Errors: A Simulation Study Comparing Newey-West and Prais-Winsten Methods

Ariel Linden

arxiv: 2603.24814 · v3 · pith:GY5PNVZMnew · submitted 2026-03-25 · 📊 stat.AP

Multiple-group (Controlled) Interrupted Time Series Analysis with Higher-Order Autoregressive Errors: A Simulation Study Comparing Newey-West and Prais-Winsten Methods

Ariel Linden This is my paper

Pith reviewed 2026-05-15 00:07 UTC · model grok-4.3

classification 📊 stat.AP

keywords interrupted time seriesautoregressive errorsNewey-WestPrais-Winstensimulationmultiple grouptype I errorcoverage

0 comments

The pith

Prais-Winsten regression provides valid inference under higher-order autoregressive errors in multiple-group interrupted time series analysis while OLS with Newey-West errors does not.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper simulates multiple-group controlled interrupted time series data with AR2 and AR3 error structures to compare OLS with Newey-West standard errors against Prais-Winsten regression. Both methods yield approximately unbiased estimates of treatment effects on level and trend. Yet OLS-NW produces poor confidence interval coverage and elevated type I error rates that worsen with higher autocorrelation order and longer series, dropping to 45-50 percent coverage under highly persistent autocorrelation at 100 time points, while Prais-Winsten holds at 91-94 percent. Readers care because these methods are used to assess healthcare interventions, and invalid inference can lead to wrong decisions about program effectiveness. The findings indicate that the power-inference tradeoff intensifies beyond first-order autocorrelation.

Core claim

In Monte Carlo simulations of a multiple-group interrupted time series model with four control units, data generated under AR2 and AR3 processes with mild, oscillatory, and highly persistent autocorrelation, Prais-Winsten regression maintains near-nominal 95 percent confidence interval coverage and type I error control for difference-in-differences estimates of level and trend changes, whereas OLS with Newey-West standard errors shows substantial undercoverage and type I error inflation, especially with highly persistent errors and longer series lengths, though both approaches remain unbiased in their point estimates.

What carries the argument

Monte Carlo simulation of MG-ITSA model comparing OLS-NW and PW under AR2 and AR3 errors of varying persistence.

If this is right

Analysts should prefer Prais-Winsten regression for hypothesis testing in MG-ITSA to achieve valid error control when higher-order autocorrelation is possible.
The power advantage of OLS-NW is offset by its inflated false positive rate.
Coverage for OLS-NW declines further as the time series length increases under persistent autocorrelation.
These patterns persist across sensitivity analyses with alternative designs.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Published studies using OLS-NW on similar data may have reported more significant findings than warranted.
Data analysts could first test for the order of autocorrelation to select the appropriate method.
Extensions of Prais-Winsten to other complex time series designs in health research warrant investigation.

Load-bearing premise

The specific parameterizations of mild, oscillatory, and highly persistent AR2 and AR3 processes in the simulations represent the autocorrelation structures typical in real multiple-group healthcare time series data.

What would settle it

A new simulation using different AR parameter values or an empirical analysis of real data with known higher-order autocorrelation where OLS-NW achieves coverage rates close to 95 percent would falsify the finding.

Figures

Figures reproduced from arXiv: 2603.24814 by Ariel Linden.

**Figure 1.** Figure 1: Comparison of Newey-West and Prais-Winsten methods on power for AR[2] error structures for difference-in-differences in trend. Columns represent different effect sizes (25%, 50%, and 100%); rows represent different autocorrelation scenarios (mild positive: ρ = (0.4, 0.2); oscillatory: ρ = (0.5, −0.4); high persistent positive: ρ = (0.7, 0.2)). 32 [PITH_FULL_IMAGE:figures/full_fig_p033_1.png] view at source ↗

**Figure 2.** Figure 2: Comparison of Newey-West and Prais-Winsten methods on 95% coverage for AR[2] error structures for difference-in-differences in trend. Columns represent different effect sizes (25%, 50%, and 100%); rows represent different autocorrelation scenarios (mild positive: ρ = (0.4, 0.2); oscillatory: ρ = (0.5, −0.4); high persistent positive: ρ = (0.7, 0.2)) [PITH_FULL_IMAGE:figures/full_fig_p034_2.png] view at source ↗

**Figure 3.** Figure 3: Comparison of Newey-West and Prais-Winsten methods on Type I error rates for AR[2] error structures for difference-in-differences in trend. Columns represent different autocorrelation scenarios (mild positive: ρ = (0.4, 0.2); oscillatory: ρ = (0.5, −0.4); high persistent positive: ρ = (0.7, 0.2)). 33 [PITH_FULL_IMAGE:figures/full_fig_p034_3.png] view at source ↗

**Figure 4.** Figure 4: Comparison of Newey-West and Prais-Winsten methods on percent bias for AR[2] error structures for difference-in-differences in trend. Columns represent different effect sizes (25%, 50%, and 100%); rows represent different autocorrelation scenarios (mild positive: ρ = (0.4, 0.2); oscillatory: ρ = (0.5, −0.4); high persistent positive: ρ = (0.7, 0.2)). 34 [PITH_FULL_IMAGE:figures/full_fig_p035_4.png] view at source ↗

**Figure 5.** Figure 5: Comparison of Newey-West and Prais-Winsten methods on root mean squared error (RMSE) for AR[2] error structures for difference-in-differences in trend. Columns represent different effect sizes (25%, 50%, and 100%); rows represent different autocorrelation scenarios (mild positive: ρ = (0.4, 0.2); oscillatory: ρ = (0.5, −0.4); high persistent positive: ρ = (0.7, 0.2)). 35 [PITH_FULL_IMAGE:figures/full_fig_… view at source ↗

**Figure 6.** Figure 6: Comparison of Newey-West and Prais-Winsten methods on empirical standard errors for AR[2] error structures for difference-in-differences in trend. Columns represent different effect sizes (25%, 50%, and 100%); rows represent different autocorrelation scenarios (mild positive: ρ = (0.4, 0.2); oscillatory: ρ = (0.5, −0.4); high persistent positive: ρ = (0.7, 0.2)). 36 [PITH_FULL_IMAGE:figures/full_fig_p037_6.png] view at source ↗

**Figure 7.** Figure 7: Comparison of Newey-West and Prais-Winsten methods on power for AR[3] error structures for difference-in-differences in trend. Columns represent different effect sizes (25%, 50%, and 100%); rows represent different autocorrelation scenarios (mild positive: ρ = (0.4, 0.2, 0.1); oscillatory: ρ = (0.7, −0.3, 0.15); high persistent positive: ρ = (0.6, 0.25, 0.1)). 37 [PITH_FULL_IMAGE:figures/full_fig_p038_7.png] view at source ↗

**Figure 8.** Figure 8: Comparison of Newey-West and Prais-Winsten methods on 95% coverage for AR[3] error structures for difference-in-differences in trend. Columns represent different effect sizes (25%, 50%, and 100%); rows represent different autocorrelation scenarios (mild positive: ρ = (0.4, 0.2, 0.1); oscillatory: ρ = (0.7, −0.3, 0.15); high persistent positive: ρ = (0.6, 0.25, 0.1)) [PITH_FULL_IMAGE:figures/full_fig_p039_8.png] view at source ↗

**Figure 9.** Figure 9: Comparison of Newey-West and Prais-Winsten methods on Type I error rates for AR[3] error structures for difference-in-differences in trend. Columns represent different autocorrelation scenarios (mild positive: ρ = (0.4, 0.2, 0.1); oscillatory: ρ = (0.7, −0.3, 0.15); high persistent positive: ρ = (0.6, 0.25, 0.1)). 38 [PITH_FULL_IMAGE:figures/full_fig_p039_9.png] view at source ↗

**Figure 10.** Figure 10: Comparison of Newey-West and Prais-Winsten methods on percent bias for AR[3] error structures for difference-in-differences in trend. Columns represent different effect sizes (25%, 50%, and 100%); rows represent different autocorrelation scenarios (mild positive: ρ = (0.4, 0.2, 0.1); oscillatory: ρ = (0.7, −0.3, 0.15); high persistent positive: ρ = (0.6, 0.25, 0.1)). 39 [PITH_FULL_IMAGE:figures/full_fig_… view at source ↗

**Figure 11.** Figure 11: Comparison of Newey-West and Prais-Winsten methods on root mean squared error (RMSE) for AR[3] error structures for difference-in-differences in trend. Columns represent different effect sizes (25%, 50%, and 100%); rows represent different autocorrelation scenarios (mild positive: ρ = (0.4, 0.2, 0.1); oscillatory: ρ = (0.7, −0.3, 0.15); high persistent positive: ρ = (0.6, 0.25, 0.1)). 40 [PITH_FULL_IMAGE… view at source ↗

**Figure 12.** Figure 12: Comparison of Newey-West and Prais-Winsten methods on empirical standard errors for AR[3] error structures for difference-in-differences in trend. Columns represent different effect sizes (25%, 50%, and 100%); rows represent different autocorrelation scenarios (mild positive: ρ = (0.4, 0.2, 0.1); oscillatory: ρ = (0.7, −0.3, 0.15); high persistent positive: ρ = (0.6, 0.25, 0.1)). 41 [PITH_FULL_IMAGE:figu… view at source ↗

read the original abstract

Previous comparisons of ordinary least squares with Newey-West standard errors (OLS-NW) and Prais-Winsten (PW) regression in multiple-group interrupted time series analysis have been limited to first-order autoregressive (AR[1]) errors because PW estimation for higher-order AR[k] processes was previously unavailable. We conducted the first systematic evaluation of OLS-NW and PW under AR[2] and AR[3] error structures using Monte Carlo simulation. Simulations examined mild positive, oscillatory, and high persistent autocorrelation across varying series lengths and effect sizes. OLS-NW generally showed higher apparent power but substantially inflated Type I error and poor coverage, particularly under persistent autocorrelation, where inferential performance worsened with increasing AR order and series length. PW maintained substantially better inferential calibration across nearly all conditions. Both methods were approximately unbiased.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

PW keeps coverage and type I error in check under AR(2)/AR(3) errors in these MG-ITSA simulations while OLS-NW coverage drops sharply in persistent cases.

read the letter

The main takeaway is that Prais-Winsten regression maintains reasonable coverage and type I error control under AR(2) and AR(3) errors in multiple-group interrupted time series setups, while OLS with Newey-West standard errors sees coverage fall to 45-50% and type I error rise to 50-57% in the highly persistent cases at T=100. The power advantage of OLS-NW turns out to be mostly inflated false positives. This extends earlier AR(1) comparisons in a useful way. The Monte Carlo design generates data with four control units across T=10 to 100, three autocorrelation patterns (mild, oscillatory, highly persistent), and both level and trend effects. It tracks the usual metrics—power, coverage, type I error, bias, RMSE, and empirical standard errors—and shows both estimators stay roughly unbiased. That is a clean result worth having on record. The numbers make the worsening tradeoff clear as AR order increases and series lengthen. The soft spot is the lack of any calibration of those AR coefficients against actual autocorrelation patterns in real multiple-group healthcare time series. Without that, it is hard to judge how often the worst-case persistent scenarios actually occur in the settings where MG-ITSA gets used. The sensitivity analyses stay inside the same simulated structures, so they do not close the gap. This paper is for applied researchers who run controlled interrupted time series for intervention evaluation in health services or policy work. It gives concrete guidance on choosing between two common estimators when higher-order autocorrelation is possible. I would send it for peer review. The simulation evidence is straightforward and fills a gap left by the AR(1)-only literature, even if referees will want more discussion of how representative the chosen error structures are.

Referee Report

2 major / 2 minor

Summary. The manuscript reports a Monte Carlo simulation study comparing ordinary least squares with Newey-West standard errors (OLS-NW) and Prais-Winsten (PW) regression for multiple-group controlled interrupted time series analysis (MG-ITSA) under AR(2) and AR(3) error structures. Using four control units, series lengths T=10–100, and autocorrelation parameterized as mild, oscillatory, and highly persistent, the study evaluates bias, coverage, power, type I error, RMSE, and related metrics for difference-in-differences level and trend effects. Both estimators are approximately unbiased, but OLS-NW exhibits inflated type I error and coverage as low as 45–50% under high persistence at T=100, while PW maintains 91–94% coverage; the authors conclude that the power-validity tradeoff worsens with higher-order autocorrelation and recommend PW for reliable inference.

Significance. If the chosen AR(2)/AR(3) parameterizations are representative of autocorrelation in real MG-ITSA healthcare series, the results would provide actionable guidance on estimator choice, demonstrating that apparent power gains from OLS-NW come at the cost of invalid inference and supporting PW for error control in applied work. The systematic Monte Carlo design across multiple conditions and sensitivity analyses is a strength.

major comments (2)

[Methods] Methods (data generation and AR parameterization): The central recommendation that PW is preferred for hypothesis testing in MG-ITSA rests on the claim that the simulated mild/oscillatory/highly persistent AR(2)/AR(3) structures are representative of autocorrelation encountered in actual multiple-group healthcare time series. No calibration, comparison to empirical autocovariance patterns from real datasets, or sensitivity to alternative coefficient choices is reported, so the observed coverage collapse for OLS-NW (45–50% at T=100 under high persistence) and the worsening tradeoff may not generalize to the settings where the method is applied.
[Results] Results (coverage and type I error reporting): The reported OLS-NW coverage of 45–50% and type I error of 50–57% under highly persistent autocorrelation at T=100 are load-bearing for the preference claim, yet it is unclear whether these figures are for a single worst-case cell or averaged; explicit tabulation by T, effect size, and AR order (e.g., in the main results table) is needed to assess how quickly performance degrades.

minor comments (2)

[Abstract] Abstract: The statement that 'PW showed power advantages in some settings' is not quantified; adding the specific conditions (T, AR order, effect size) where this occurs would improve precision.
[Methods] The manuscript mentions sensitivity analyses for alternative designs but provides no details on what those designs were; a short description or reference to a supplementary table would aid reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments on our simulation study. We address each major point below and have revised the manuscript accordingly to improve clarity and robustness.

read point-by-point responses

Referee: [Methods] Methods (data generation and AR parameterization): The central recommendation that PW is preferred for hypothesis testing in MG-ITSA rests on the claim that the simulated mild/oscillatory/highly persistent AR(2)/AR(3) structures are representative of autocorrelation encountered in actual multiple-group healthcare time series. No calibration, comparison to empirical autocovariance patterns from real datasets, or sensitivity to alternative coefficient choices is reported, so the observed coverage collapse for OLS-NW (45–50% at T=100 under high persistence) and the worsening tradeoff may not generalize to the settings where the method is applied.

Authors: We agree that direct calibration to empirical autocovariance from real MG-ITSA healthcare series would strengthen generalizability. The AR(2)/AR(3) coefficients were chosen to span a range of patterns (mild, oscillatory, highly persistent) drawn from common time-series literature on health data, but we did not perform explicit matching to specific datasets. In revision we have added a sensitivity analysis varying the AR coefficients across plausible ranges around our base cases and included a new discussion subsection relating these choices to observed autocorrelation in published MG-ITSA applications. We believe this addresses the concern without altering the core simulation design. revision: yes
Referee: [Results] Results (coverage and type I error reporting): The reported OLS-NW coverage of 45–50% and type I error of 50–57% under highly persistent autocorrelation at T=100 are load-bearing for the preference claim, yet it is unclear whether these figures are for a single worst-case cell or averaged; explicit tabulation by T, effect size, and AR order (e.g., in the main results table) is needed to assess how quickly performance degrades.

Authors: The 45–50% coverage and 50–57% type I error figures refer to the specific highly persistent AR(3) condition at T=100 (not averages across cells). We have revised the results section and added a supplementary table that reports coverage, type I error, power, and RMSE broken down by T, AR order, persistence level, and effect size. This makes the degradation pattern transparent and allows readers to evaluate performance across the full design grid. revision: yes

Circularity Check

0 steps flagged

No circularity: results derive from independent Monte Carlo simulations

full rationale

The paper reports a standard Monte Carlo simulation study that generates data under explicitly specified AR(2) and AR(3) processes (mild, oscillatory, highly persistent) and then evaluates finite-sample performance of OLS-NW versus PW estimators on those generated series. No parameter is fitted to the target quantities being reported, no result is renamed as a prediction, and no load-bearing premise reduces to a self-citation or self-definition. All metrics (coverage, power, type I error, bias, RMSE) are computed directly from the simulated replicates, rendering the comparison self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The findings depend on the validity of the simulated autocorrelation structures matching real data and the choice of simulation parameters like series lengths and effect sizes.

free parameters (1)

AR coefficients for mild, oscillatory, and persistent cases
Chosen to represent different autocorrelation patterns in the data generation process.

axioms (1)

domain assumption The error terms follow AR(2) or AR(3) processes with specified parameters.
This defines the data generating mechanism for the Monte Carlo study.

pith-pipeline@v0.9.0 · 5608 in / 1316 out tokens · 55989 ms · 2026-05-15T00:07:26.198752+00:00 · methodology

Multiple-group (Controlled) Interrupted Time Series Analysis with Higher-Order Autoregressive Errors: A Simulation Study Comparing Newey-West and Prais-Winsten Methods

Core claim

What carries the argument

If this is right

Where Pith is reading between the lines

Load-bearing premise

What would settle it

discussion (0)