pith. sign in

arxiv: 2303.04754 · v5 · submitted 2023-03-08 · 📊 stat.ME · stat.CO

Estimation of Long-Range Dependent Models with Missing Data: to Impute or not to Impute?

Pith reviewed 2026-05-24 09:38 UTC · model grok-4.3

classification 📊 stat.ME stat.CO
keywords long-range dependenceARFIMA modelsmissing dataimputationMonte Carlo simulationparameter estimationtime series analysis
0
0 comments X

The pith

A Monte Carlo study of 35 setups compares imputation to direct methods for estimating the long-memory parameter d in ARFIMA models with missing data.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper aims to clarify whether filling in missing observations before estimation or using methods built for incomplete series produces better estimates of the fractional parameter d in ARFIMA time series. It reviews available techniques for each route and tests them across simulated series that contain 10 to 70 percent missing values and different strengths of long-range dependence. The comparison is intended to show practitioners which route yields more reliable results under realistic conditions of incompleteness. Readers care because many observed series in economics, hydrology, and other fields contain gaps yet still require accurate measurement of persistence for prediction and inference.

Core claim

The paper conducts a Monte Carlo simulation study that compares 35 different setups for estimating d under numerous scenarios with 10% to 70% missing data and several levels of dependence, contrasting imputation-based estimation with methods tailored for missing observations in ARFIMA(p,d,q) models.

What carries the argument

Monte Carlo simulation comparing 35 estimation setups for the long-memory parameter d with missing data

If this is right

  • Imputation before estimation remains competitive with specialized missing-data estimators across a wide range of missing percentages.
  • Performance differences between the two approaches depend on both the fraction of missing observations and the strength of dependence.
  • Review of available methods shows practical options exist under both the imputation route and the direct-missing-data route.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The simulation results could let analysts avoid unnecessarily complex missing-data algorithms when simpler imputation suffices.
  • Similar large-scale comparisons could be repeated for other long-memory models or for non-Gaussian series.
  • The study framework supplies a template that later work can apply to real empirical series whose true d is known from complete subsamples.

Load-bearing premise

The Monte Carlo simulation designs and missing-data mechanisms used are representative of real-world performance for ARFIMA estimation with missing observations.

What would settle it

Take a long series with known d, delete observations according to the mechanisms studied, apply the simulation-identified best method, and check whether it recovers the true d more accurately than the alternatives.

Figures

Figures reproduced from arXiv: 2303.04754 by Gladys Choque Ulloa, Guilherme Pumi, Taiane Schaedler Prass.

Figure 1
Figure 1. Figure 1: Box plot of the fitted ARFIMA (0, 0.1, 0) model. When data imputation is in place, the variability of the estimators does not seem to be significantly impacted by the percentage of missing data. The bias, however, is very much so. The DFA, ELW, and GPH are the methods presenting the highest overall variability. It is also noteworthy that, for 10% of missing, the imputation method applied makes little diffe… view at source ↗
Figure 2
Figure 2. Figure 2: Box plot of the fitted ARFIMA(0, 0.4, 0) model. 3.5 Time benchmarking In this section, we compare the computational speed of each estimator considered in the simulations. Besides which estimator is the fastest to compute, there are a few other questions regarding computational speed that are of interest. For instance, does doubling the length of the time series double the time required to estimate d? Is th… view at source ↗
Figure 1
Figure 1. Figure 1: Box plot of the adjusted model ARFIMA (0 [PITH_FULL_IMAGE:figures/full_fig_p031_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Box plot of the adjusted model ARFIMA (0 [PITH_FULL_IMAGE:figures/full_fig_p031_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Box plot of the adjusted model ARFIMA (1 [PITH_FULL_IMAGE:figures/full_fig_p034_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Box plot of the adjusted model ARFIMA (1 [PITH_FULL_IMAGE:figures/full_fig_p034_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Box plot of the adjusted model ARFIMA (1 [PITH_FULL_IMAGE:figures/full_fig_p035_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Box plot of the adjusted model ARFIMA (1 [PITH_FULL_IMAGE:figures/full_fig_p035_6.png] view at source ↗
read the original abstract

Among the most important models for long-range dependent time series is the class of ARFIMA$(p,d,q)$ (Autoregressive Fractionally Integrated Moving Average) models. Estimating the long-range dependence parameter $d$ in ARFIMA models is a well-studied problem, but the literature regarding the estimation of $d$ in the presence of missing data is very sparse. There are two basic approaches to dealing with the problem: missing data can be imputed using some plausible method, and then the estimation can proceed as if no data were missing, or we can use a specially tailored methodology to estimate $d$ in the presence of missing data. In this work, we review some of the methods available for both approaches and compare them through a Monte Carlo simulation study. We present a comparison among 35 different setups to estimate $d$, under tenths of different scenarios, considering percentages of missing data ranging from as few as 10\% up to 70\% and several levels of dependence.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 0 minor

Summary. The paper reviews imputation-based and specialized likelihood methods for estimating the long-memory parameter d in ARFIMA(p,d,q) models with missing observations. It conducts a Monte Carlo study comparing 35 estimation setups across missing-data fractions from 10% to 70% and multiple dependence strengths, with the goal of determining whether imputation or tailored methods perform better under these conditions.

Significance. The literature on ARFIMA estimation with missing data is sparse, so a systematic comparison of imputation versus direct methods could offer practical guidance. The value of the comparison hinges on whether the simulated missingness mechanisms are representative of realistic dependence structures in long-memory series; if they are, the results would help practitioners choose between the two approaches.

major comments (2)
  1. [Simulation study] Simulation study section: the manuscript does not specify whether missingness is generated under MCAR, MAR, or MNAR, nor whether the observation indicator is allowed to depend on lagged values of the ARFIMA process. In long-memory data this dependence can materially alter the bias of imputed estimators relative to exact or approximate likelihood methods that account for the missing pattern; without this information the reported performance rankings cannot be generalized beyond the particular simulation design.
  2. [Simulation study] Simulation study section: the 35 setups are described only at a high level in the abstract; the precise combination of imputation techniques, likelihood approximations, and software implementations used for each setup is not enumerated, making it impossible to reproduce or assess whether the comparison is exhaustive or contains redundant variants.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments on our Monte Carlo study. We address each major comment below and will revise the manuscript to improve clarity and reproducibility.

read point-by-point responses
  1. Referee: [Simulation study] Simulation study section: the manuscript does not specify whether missingness is generated under MCAR, MAR, or MNAR, nor whether the observation indicator is allowed to depend on lagged values of the ARFIMA process. In long-memory data this dependence can materially alter the bias of imputed estimators relative to exact or approximate likelihood methods that account for the missing pattern; without this information the reported performance rankings cannot be generalized beyond the particular simulation design.

    Authors: We agree that the missingness mechanism must be clearly stated for proper interpretation. Our simulations used an MCAR mechanism in which the observation indicator is generated independently of the ARFIMA process values and their lags. We will add an explicit description of the data-generation process, including the MCAR assumption and independence from lagged values, to the Simulation study section. revision: yes

  2. Referee: [Simulation study] Simulation study section: the 35 setups are described only at a high level in the abstract; the precise combination of imputation techniques, likelihood approximations, and software implementations used for each setup is not enumerated, making it impossible to reproduce or assess whether the comparison is exhaustive or contains redundant variants.

    Authors: We accept that a high-level description limits reproducibility. The revised manuscript will contain a detailed table enumerating all 35 setups, specifying the exact imputation technique, likelihood method or approximation, software package and version, and any tuning parameters for each combination. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical Monte Carlo comparison with no derivation chain

full rationale

This is a pure simulation study that generates performance metrics for 35 estimation setups across missing-data percentages and dependence levels. No mathematical derivation, parameter fitting to target quantities, or self-citation load-bearing premise is present. The central output (relative performance rankings) is produced by running the estimators on simulated series; it does not reduce to any input by construction. External benchmarks (real data or alternative missingness mechanisms) are not required for the internal consistency of the reported Monte Carlo results.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No free parameters, axioms, or invented entities are introduced; the paper is a comparative simulation study of existing statistical procedures for ARFIMA estimation.

pith-pipeline@v0.9.0 · 5714 in / 945 out tokens · 26974 ms · 2026-05-24T09:38:04.466171+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

12 extracted references · 12 canonical work pages

  1. [1]

    Assume that the copula related to ( X0, Xk) is Cθk, where {θk}k∈N∗ is a sequence in D satisfying lim n→∞ θn = a

    < ∞. Assume that the copula related to ( X0, Xk) is Cθk, where {θk}k∈N∗ is a sequence in D satisfying lim n→∞ θn = a. Assume further that Cov( X0, Xn) ∼ R(n, η), where R(n, η) is a given continuous function such that R(n, η) → 0, as n goes to infinity, and η ∈ S ⊆ Rp is some (identifiable) parameter of interest. Also assume that θn − a ∼ L(n, η), 26 where...

  2. [2]

    Given {Cθ}θ∈Θ as discussed above, chose a method to perform parameter estimation in the family

  3. [3]

    Estimate ˆK1 and ˆK2 by plugging in these estimators into (A.13) and (A.14), respectively

    Chose estimators ˆFn, ˆF −1 n and ˆF ′ n of the underlying unknown distribution F , quantile function F −1 and density function F ′, respectively. Estimate ˆK1 and ˆK2 by plugging in these estimators into (A.13) and (A.14), respectively. We must have 0 < ˆK1 < ∞ and ˆK2 < ∞

  4. [4]

    Set yi := ˆFn(xi), for i = 1, · · · , n

  5. [5]

    For each ℓ ∈ { s, · · · , m}, form a sequence {u(ℓ) k }n−ℓ k=1 by setting u(ℓ) i := (yi, yi+ℓ) ∈ [0, 1]2, i = 1, · · · , n − ℓ

    Let s and m be two integers satisfying 1 < s < m < n . For each ℓ ∈ { s, · · · , m}, form a sequence {u(ℓ) k }n−ℓ k=1 by setting u(ℓ) i := (yi, yi+ℓ) ∈ [0, 1]2, i = 1, · · · , n − ℓ. From these pseudo-observations, estimate of the copula parameter θℓ, denoted by ˆθℓ(n)

  6. [6]

    Let bLs,m(n) := ˆK1(ˆθs(n) − a, · · · , ˆθm(n) − a)′ and Rs,m(η) := R(s, η), · · · , R(m, η) ′

    Let D : Rm−s+1 ×Rm−s+1 → [0, ∞), be a given function measuring the distance between two vectors in Rm−s+1. Let bLs,m(n) := ˆK1(ˆθs(n) − a, · · · , ˆθm(n) − a)′ and Rs,m(η) := R(s, η), · · · , R(m, η) ′. The estimator ˆηs,m(n) of η is then defined as ˆηs,m(n) := argmin η∈S D bLs,m(n), Rs,m(η) . (A.15) In practice, the estimation procedure in 1 and the esti...

  7. [7]

    native2. mean3. linear4. random C.abry C.full DFA ELW GPH LoMPE LW PP.F PP.G RS C.abry C.full DFA ELW GPH LoMPE LW PP.F PP.G RS C.abry C.full DFA ELW GPH LoMPE LW PP.F PP.G RS −0.6 −0.3 0.0 0.3 0.6 −0.6 −0.3 0.0 0.3 0.6 −0.6 −0.3 0.0 0.3 0.6 −0.6 −0.3 0.0 0.3 0.6 Estimator C.abry C.full DFA ELW GPH LoMPE LW PP.F PP.G RS Figure 1: Box plot of the adjusted ...

  8. [8]

    native2. mean3. linear4. random C.abry C.full DFA ELW GPH LoMPE LW PP.F PP.G RS C.abry C.full DFA ELW GPH LoMPE LW PP.F PP.G RS C.abry C.full DFA ELW GPH LoMPE LW PP.F PP.G RS −0.6 −0.3 0.0 0.3 0.6 −0.6 −0.3 0.0 0.3 0.6 −0.6 −0.3 0.0 0.3 0.6 −0.6 −0.3 0.0 0.3 0.6 Estimator C.abry C.full DFA ELW GPH LoMPE LW PP.F PP.G RS Figure 2: Box plot of the adjusted ...

  9. [9]

    native2. mean3. linear4. random C.abry C.full DFA ELW GPH LoMPE LW PP.F PP.G RS C.abry C.full DFA ELW GPH LoMPE LW PP.F PP.G RS C.abry C.full DFA ELW GPH LoMPE LW PP.F PP.G RS −0.6 −0.3 0.0 0.3 0.6 −0.6 −0.3 0.0 0.3 0.6 −0.6 −0.3 0.0 0.3 0.6 −0.6 −0.3 0.0 0.3 0.6 Estimator C.abry C.full DFA ELW GPH LoMPE LW PP.F PP.G RS Figure 3: Box plot of the adjusted ...

  10. [10]

    native2. mean3. linear4. random C.abry C.full DFA ELW GPH LoMPE LW PP.F PP.G RS C.abry C.full DFA ELW GPH LoMPE LW PP.F PP.G RS C.abry C.full DFA ELW GPH LoMPE LW PP.F PP.G RS −0.6 −0.3 0.0 0.3 0.6 −0.6 −0.3 0.0 0.3 0.6 −0.6 −0.3 0.0 0.3 0.6 −0.6 −0.3 0.0 0.3 0.6 Estimator C.abry C.full DFA ELW GPH LoMPE LW PP.F PP.G RS Figure 4: Box plot of the adjusted ...

  11. [11]

    native2. mean3. linear4. random C.abry C.full DFA ELW GPH LoMPE LW PP.F PP.G RS C.abry C.full DFA ELW GPH LoMPE LW PP.F PP.G RS C.abry C.full DFA ELW GPH LoMPE LW PP.F PP.G RS −0.6 −0.3 0.0 0.3 0.6 −0.6 −0.3 0.0 0.3 0.6 −0.6 −0.3 0.0 0.3 0.6 −0.6 −0.3 0.0 0.3 0.6 Estimator C.abry C.full DFA ELW GPH LoMPE LW PP.F PP.G RS Figure 5: Box plot of the adjusted ...

  12. [12]

    native2. mean3. linear4. random C.abry C.full DFA ELW GPH LoMPE LW PP.F PP.G RS C.abry C.full DFA ELW GPH LoMPE LW PP.F PP.G RS C.abry C.full DFA ELW GPH LoMPE LW PP.F PP.G RS −0.6 −0.3 0.0 0.3 0.6 −0.6 −0.3 0.0 0.3 0.6 −0.6 −0.3 0.0 0.3 0.6 −0.6 −0.3 0.0 0.3 0.6 Estimator C.abry C.full DFA ELW GPH LoMPE LW PP.F PP.G RS Figure 6: Box plot of the adjusted ...