Exponentially weighted estimands and the exponential family: Filtering, prediction and smoothing

Neil Shephard; Simon Donker van Heel

arxiv: 2512.16745 · v3 · submitted 2025-12-18 · 📊 stat.ME · econ.EM

Exponentially weighted estimands and the exponential family: Filtering, prediction and smoothing

Simon Donker van Heel , Neil Shephard This is my paper

Pith reviewed 2026-05-16 21:10 UTC · model grok-4.3

classification 📊 stat.ME econ.EM

keywords exponential familyfilteringsmoothingpredictiontime seriesexponentially weightedmaximum likelihoodrecursive estimation

0 comments

The pith

Maximizing a discounted convex combination of the log-likelihood and its expected value produces exact linear filters, predictors and smoothers for the canonical exponential family.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces exponentially weighted estimands formed by maximizing a discounted mix of the observed log-likelihood and the expected log-likelihood. For time series belonging to the canonical exponential family this maximization yields exact filters, predictors and smoothers whose updates obey simple linear recursions. The approach supplies a complete theory for these recursions and demonstrates them on both simulated series and real data. A reader would care because the resulting procedures are computationally cheap, exact, and avoid the approximations common in sequential estimation for common statistical models.

Core claim

By replacing ordinary maximum-likelihood estimation with the maximization of a discounted convex combination of the log-likelihood and the corresponding expected log-likelihood, one obtains filters, predictors and smoothers for exponential-family time series. In the canonical case these objects satisfy exact linear recursions whose coefficients are simple functions of the natural parameter and the discount factor.

What carries the argument

The exponentially weighted estimand: the argmax of a discounted convex combination of the log-likelihood with its expectation under the current parameter.

If this is right

Exact linear recursions exist for filtering, one-step prediction and smoothing inside the canonical exponential family.
The recursions are driven only by the current observation, the previous estimate and the fixed discount factor.
The same construction supplies a consistent theory for the asymptotic behavior of these estimators under standard regularity conditions.
The procedures apply immediately to common models such as Poisson, Bernoulli and normal time series.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The linear structure may allow closed-form expressions for forecast intervals without simulation.
The discount factor acts as a tuning parameter that trades responsiveness against smoothness, suggesting systematic selection rules could be derived.
The framework could be tested on multivariate exponential-family series to see whether the linear recursions survive the vector case.

Load-bearing premise

Maximizing the proposed discounted convex combination of the log-likelihood and expected log-likelihood directly produces the desired filter, predictor and smoother.

What would settle it

Derive the closed-form linear recursion for a Poisson or Bernoulli series, run it on simulated data with known true parameters, and check whether the filtered estimates match the exact conditional means obtained by direct integration.

Figures

Figures reproduced from arXiv: 2512.16745 by Neil Shephard, Simon Donker van Heel.

**Figure 1.** Figure 1: Simulation from Yt |Y1:t−1 ∼ CEF(θe t|t−1, h, ψ) for time t = 5, ..., T = 2000 with discount parameter λ = 0.93. Top: anchoring parameter α = 0.7; bottom: α = 0.95. Columns: evolution of E[Y^t |Y1:t−1] = ψ ′ (θe t|t−1) in blue with observations Yt as gray circles. Exponential Gaussian (zero mean) Pareto α = 0.70 α = 0.95 [PITH_FULL_IMAGE:figures/full_fig_p018_1.png] view at source ↗

**Figure 2.** Figure 2: Simulation from Yt |Y1:t−1 ∼ CEF(θe t|t−1, h, ψ) for time t = 5, ..., T = 2000 with discount parameter λ = 0.93. Top: anchoring parameter α = 0.7; bottom: α = 0.95. Gray circles show observations Yt . Blue lines show the conditional expectation: E[Y^t |Y1:t−1] = ψ ′ (θe t|t−1) for the exponential, and E[Y^t |Y1:t−1] = θe t|t−1 θe t|t−1+1 for the Pareto (y-axis on log scale) when θe t|t−1 < −1 and ∞ otherwi… view at source ↗

**Figure 3.** Figure 3: Simulation from Yt |Y1:t−1 ∼ CEF(θe t|t−1, h, ψ) for time t = 5, ..., T = 2000 with discount parameter λ = 0.93. Top: anchoring parameter α = 0.7; bottom: α = 0.95. Columns: evolution of the conditional expectation E[Y^t |Y1:t−1] (blue line) with observations Yt (gray circles) for the Beta, Gaussian (time-varying mean & variance), & von Mises distributions. For the Gaussian distribution, the red line show… view at source ↗

**Figure 4.** Figure 4: The product Ktnλ,t against t for q ∈ {0.001, 0.1, 0.3, 1, 2, 10}. This is the ratio of the Kalman filter’s weight on Yt (under diffuse initialization) to the EWMA weight on Yt , showing the impact of initial conditions. Values below 1 mean the Kalman filter places less weight on the current observation than the EWMA. All curves converge to 1 in steady state. Deviations from 1 are largest, and convergence t… view at source ↗

**Figure 5.** Figure 5: Household financial situation: Uni. Michigan Survey of Consumers (Jan 1978 to [PITH_FULL_IMAGE:figures/full_fig_p031_5.png] view at source ↗

**Figure 6.** Figure 6: Monte Carlo: quasi-likelihood estimator precision for household financial expec [PITH_FULL_IMAGE:figures/full_fig_p033_6.png] view at source ↗

read the original abstract

We propose using a discounted version of a convex combination of the log-likelihood with the corresponding expected log-likelihood such that when they are maximized they yield a filter, predictor and smoother for time series. This paper then focuses on working out the implications of this in the case of the canonical exponential family. The results are simple exact filters, predictors and smoothers with linear recursions. A theory for these models is developed and the models are illustrated on simulated and real data.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper gives exact linear recursions for filters, predictors, and smoothers in the canonical exponential family by maximizing a discounted convex combination of log-likelihood and expected log-likelihood.

read the letter

The main takeaway is that this construction produces simple exact linear updates for the natural parameters across standard exponential family time series models. The discounting introduces a forgetting factor while keeping the recursions linear, which is the practical payoff for filtering, prediction, and smoothing. They work out the details for the canonical case and back it with simulated and real data examples. That part is useful and cleanly executed. The theory section appears to show how the argmax of their objective stays affine in the sufficient statistics, which is what delivers the linearity. Credit to them for making the recursions explicit rather than leaving them as implicit optimization problems. The soft spot is exactly where the stress test points: the expected log-likelihood term has to be set up so its contribution remains linear after discounting, and any dependence on the base measure or the specific discount schedule could break that. The abstract is light on the algebra, so the full derivations need to be checked for hidden restrictions. If those steps hold without extra assumptions, the result is solid; if they only work for certain families or discount rates, the scope narrows. This is for statisticians and econometricians who already use exponential family models for counts, binary, or positive data and want exact rather than approximate recursive estimators. A reader working on state-space or online estimation would find the linear form directly usable. It deserves a serious referee because the claim is specific, the payoff is clear, and the derivations are checkable in finite time.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes a discounted convex combination of the log-likelihood and the corresponding expected log-likelihood, maximized to produce filters, predictors, and smoothers for time series. For members of the canonical exponential family, the resulting estimators admit exact linear recursions in the natural parameters. A supporting theory is developed and the methods are demonstrated on simulated and real data examples.

Significance. If the derivations hold, the work supplies a clean, exact recursive framework for exponential-family time series that incorporates forgetting via discounting while preserving linearity. This is a useful addition to the toolkit for sequential estimation, as it extends conjugacy-like behavior to non-stationary settings without requiring particle methods or approximations. The linear-recursion property is particularly valuable for implementation and theoretical analysis.

major comments (2)

[§3.2] §3.2, Eq. (8)–(11): the central claim that the argmax of the discounted objective satisfies a linear recursion in the natural parameter is asserted but the derivation does not explicitly verify that the gradient of the expected-log-likelihood term remains affine in the sufficient statistic after the discount factor is introduced; a concrete expansion for at least one non-Gaussian member (e.g., Poisson) is needed to confirm the cancellation.
[Theorem 2] Theorem 2 (smoother recursion): the backward recursion is presented as exact, yet the proof sketch relies on the same discounted conjugacy that is under scrutiny in the filter step; if the forward filter already contains an approximation, the smoother cannot be guaranteed exact without additional error bounds.

minor comments (2)

[§2] Notation for the discount factor λ_t is introduced without a clear statement of whether it is time-varying or constant; consistency across filter/predictor/smoother sections would improve readability.
[Figure 2] Figure 2 caption does not specify the sample size or the exact exponential-family member used in the simulation; adding these details would aid reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the careful reading and constructive comments on our manuscript. The suggestions will help strengthen the presentation of the derivations. We address each major comment below.

read point-by-point responses

Referee: [§3.2] §3.2, Eq. (8)–(11): the central claim that the argmax of the discounted objective satisfies a linear recursion in the natural parameter is asserted but the derivation does not explicitly verify that the gradient of the expected-log-likelihood term remains affine in the sufficient statistic after the discount factor is introduced; a concrete expansion for at least one non-Gaussian member (e.g., Poisson) is needed to confirm the cancellation.

Authors: We agree that an explicit verification for a non-Gaussian member would improve clarity. The general argument in Section 3.2 relies on the fact that the gradient of the expected log-likelihood term is the difference between the observed and expected sufficient statistics, which remains affine in the natural parameter after discounting because the discount factor multiplies the entire term uniformly. In the revision we will add a concrete expansion for the Poisson case, showing the explicit cancellation in the score equation that yields the linear recursion for the natural parameter. revision: yes
Referee: [Theorem 2] Theorem 2 (smoother recursion): the backward recursion is presented as exact, yet the proof sketch relies on the same discounted conjugacy that is under scrutiny in the filter step; if the forward filter already contains an approximation, the smoother cannot be guaranteed exact without additional error bounds.

Authors: The forward filter is obtained exactly by maximizing the discounted objective; no approximation is introduced because the canonical exponential-family structure preserves the required conjugacy under uniform discounting. The smoother recursion is then derived exactly from the forward quantities via the same conjugacy. We will expand the proof of Theorem 2 to include an explicit inductive argument confirming that exactness propagates backward without error accumulation. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation derives linear recursions directly from the proposed discounted objective

full rationale

The paper introduces a discounted convex combination of the log-likelihood and expected log-likelihood as a new objective, then shows that its maximizer for canonical exponential family members yields exact linear recursions for filtering, prediction and smoothing. This is a constructive derivation from the stated objective rather than a re-expression of pre-fitted quantities or a self-citation chain. No load-bearing step reduces by construction to its own inputs; the linearity follows from the exponential-family structure under the proposed discounting. The abstract and theory section present this as an implication to be worked out, not as an assumption smuggled in via prior work by the same authors. External benchmarks (simulated and real data) are used for illustration rather than for fitting the core recursions.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the unproven assertion that maximization of the discounted convex combination produces the stated filter/predictor/smoother; no free parameters or invented entities are mentioned in the abstract.

axioms (1)

domain assumption Maximizing the discounted convex combination of log-likelihood and expected log-likelihood yields a filter, predictor, and smoother.
This is the core proposal stated in the abstract; its validity is required for all subsequent claims.

pith-pipeline@v0.9.0 · 5368 in / 1198 out tokens · 21917 ms · 2026-05-16T21:10:45.945951+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

36 extracted references · 36 canonical work pages

[1]

Abramowitz, M. and I. A. Stegun (1970).Handbook of Mathematical Functions. New York: Dover Publications Inc

work page 1970
[2]

Benjamin, M. A., R. A. Rigby, and D. M. Stasinopoulos (2003). Generalized autoregressive moving average model.Journal of the American Statistical Association 98, 214–223

work page 2003
[3]

Blasques, F., S. J. Koopman, M. Mallee, and Z. Zhang (2016). Weighted maximum likeli- hood for dynamic factor analysis and forecasting with mixed frequency data.Journal of Econometrics 193(2), 405–417

work page 2016
[4]

Bollerslev, T. (1986). Generalised autoregressive conditional heteroskedasticity.Journal of Econometrics 51, 307–327

work page 1986
[5]

Boyd, S. and L. Vandenberghe (2004).Convex Optimization. Cambridge University Press

work page 2004
[6]

Brown, B. M. (1971). Martingale central limit theorems.Annals of Mathematical Statis- tics 49, 59–66

work page 1971
[7]

Brown, R. G. (1956).Exponential smoothing for predicting demand. Cambridge, Mas- sachusetts: Author D Little, Inc

work page 1956
[8]

Cox, D. R. (1961). Tests of seperate families of hypotheses.Proceedings of the Berkeley Symposium 4, 105–123. 34

work page 1961
[9]

Creal, D., S. J. Koopman, and A. Lucas (2013). Generalized autoregressive score models with applications.Journal of Applied Econometrics 28, 777–795

work page 2013
[10]

Davis, R. A., K. Fokianos, S. H. Holan, H. Joe, J. Livsey, R. Lund, V. Pipiras, and N. Rav- ishanker (2021). Count time series: A methodological review.Journal of the American Statistical Association 116(535), 1533–1547

work page 2021
[11]

Dixon, M. J. and S. G. Coles (1997). Modelling association football scores and inefficiencies in the football betting market.Journal of the Royal Statistical Society: Series C (Applied Statistics) 46(2), 265–280

work page 1997
[12]

Durbin, J. and S. J. Koopman (2012).Time Series Analysis by State Space Methods(2 ed.). Oxford: Oxford University Press

work page 2012
[13]

(2012).Large-Scale Inference: Empirical Bayes Methods for Estimation, Testing, and Prediction

Efron, B. (2012).Large-Scale Inference: Empirical Bayes Methods for Estimation, Testing, and Prediction. Cambridge University Press

work page 2012
[14]

(2022).Exponential Families in Theory and Practice

Efron, B. (2022).Exponential Families in Theory and Practice. Cambridge University Press

work page 2022
[15]

Efron, B. and C. Morris (1977). Stein’s paradox in statistics.Scientific Americian 236, 119–127

work page 1977
[16]

Engle, R. F. (1982). Autoregressive conditional heteroskedasticity with estimates of the variance of the United Kingdom inflation.Econometrica 50, 987–1007

work page 1982
[17]

Engle, R. F. and J. Mezrich (1996). GARCH for groups.Risk, 36–40

work page 1996
[18]

Fan, J., N. E. Heckman, and M. P. Wand (1995). Local polynomial kernel regression for gen- eralized linear models and quasi-likelihood functions.Journal of the American Statistical Association 90, 141–150

work page 1995
[19]

Fan, J. and Q. Yao (2005).Nonlinear Time Series. New York: Springer

work page 2005
[20]

Horvath, and J.-M

Francq, C., L. Horvath, and J.-M. Zako¨ ıan (2013). Merits and drawbacks of variance target- ing in garch models.Journal of Financial Econometrics 9, 619–656

work page 2013
[21]

Gallant, A. R. (1987).Nonlinear Statistical Models. New York: John Wiley

work page 1987
[22]

Harvey, A. C. (1989).Forecasting, Structural Time Series Models and the Kalman Filter. Cambridge: Cambridge University Press

work page 1989
[23]

Harvey, A. C. (2013).Dynamic models for volatility and heavy tails: With applications to financial and economic time series. Cambridge University Press

work page 2013
[24]

Hoerl, A. E. and R. W. Kennard (1970). Ridge regression: Biased estimation for nonorthog- onal problems.Technometrics 12, 55––67

work page 1970
[25]

Holmes, C. C. and S. G. Walker (2017). Assigning a value to a power likelihood in a general bayesian model.Biometrika 104, 497–503

work page 2017
[26]

Hu, F. and J. V. Zidek (2002). The weighted likelihood.Canadian Journal of Statistics 30(3), 347–371

work page 2002
[27]

Huber, P. J. (1967). The behavior of maximum likelihood estimates under nonstandard conditions. InProceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability, pp. 221–233. University of California Press. 35

work page 1967
[28]

Janson, S. (2021). A central limit theorem for m-dependent variables. Unpublished paper: Department of Mathematics, Uppsala University

work page 2021
[29]

Li, W. K. (1994). Time series models based on generalized linear models: Some further results.Biometrics 50, 506–511

work page 1994
[30]

Luxenberg, E. and S. Boyd (2024). Exponentially weighted moving models. Unpublished paper: Stanford University

work page 2024
[31]

McCullagh, P. and J. A. Nelder (1989).Generalized Linear Models(2 ed.). London: Chap- man & Hall

work page 1989
[32]

Newey, W. K. and D. McFadden (1994). Large sample estimation and hypothesis testing. In R. F. Engle and D. McFadden (Eds.),The Handbook of Econometrics, Volume 4, pp. 2111–2245. North-Holland

work page 1994
[33]

Tibshirani, R. (1996). Regression shrinkage and selection via the Lasso.Journal of the Royal Statistical Society, Series B 58, 267–288

work page 1996
[34]

Wedderburn, R. W. M. (1974). Quasi-likelihood functions, generalized linear models and the Gauss-Newton methods.Biometrika 61, 439–47

work page 1974
[35]

White, H. (1982). Maximum likelihood estimation of misspecified models.Econometrica 50, 1–25

work page 1982
[36]

Zeger, S. L. and B. Qaqish (1988). Markov regression models for time series, a quasi likelihood approach.Biometrics 44, 1019–1032. 36

work page 1988

[1] [1]

Abramowitz, M. and I. A. Stegun (1970).Handbook of Mathematical Functions. New York: Dover Publications Inc

work page 1970

[2] [2]

Benjamin, M. A., R. A. Rigby, and D. M. Stasinopoulos (2003). Generalized autoregressive moving average model.Journal of the American Statistical Association 98, 214–223

work page 2003

[3] [3]

Blasques, F., S. J. Koopman, M. Mallee, and Z. Zhang (2016). Weighted maximum likeli- hood for dynamic factor analysis and forecasting with mixed frequency data.Journal of Econometrics 193(2), 405–417

work page 2016

[4] [4]

Bollerslev, T. (1986). Generalised autoregressive conditional heteroskedasticity.Journal of Econometrics 51, 307–327

work page 1986

[5] [5]

Boyd, S. and L. Vandenberghe (2004).Convex Optimization. Cambridge University Press

work page 2004

[6] [6]

Brown, B. M. (1971). Martingale central limit theorems.Annals of Mathematical Statis- tics 49, 59–66

work page 1971

[7] [7]

Brown, R. G. (1956).Exponential smoothing for predicting demand. Cambridge, Mas- sachusetts: Author D Little, Inc

work page 1956

[8] [8]

Cox, D. R. (1961). Tests of seperate families of hypotheses.Proceedings of the Berkeley Symposium 4, 105–123. 34

work page 1961

[9] [9]

Creal, D., S. J. Koopman, and A. Lucas (2013). Generalized autoregressive score models with applications.Journal of Applied Econometrics 28, 777–795

work page 2013

[10] [10]

Davis, R. A., K. Fokianos, S. H. Holan, H. Joe, J. Livsey, R. Lund, V. Pipiras, and N. Rav- ishanker (2021). Count time series: A methodological review.Journal of the American Statistical Association 116(535), 1533–1547

work page 2021

[11] [11]

Dixon, M. J. and S. G. Coles (1997). Modelling association football scores and inefficiencies in the football betting market.Journal of the Royal Statistical Society: Series C (Applied Statistics) 46(2), 265–280

work page 1997

[12] [12]

Durbin, J. and S. J. Koopman (2012).Time Series Analysis by State Space Methods(2 ed.). Oxford: Oxford University Press

work page 2012

[13] [13]

(2012).Large-Scale Inference: Empirical Bayes Methods for Estimation, Testing, and Prediction

Efron, B. (2012).Large-Scale Inference: Empirical Bayes Methods for Estimation, Testing, and Prediction. Cambridge University Press

work page 2012

[14] [14]

(2022).Exponential Families in Theory and Practice

Efron, B. (2022).Exponential Families in Theory and Practice. Cambridge University Press

work page 2022

[15] [15]

Efron, B. and C. Morris (1977). Stein’s paradox in statistics.Scientific Americian 236, 119–127

work page 1977

[16] [16]

Engle, R. F. (1982). Autoregressive conditional heteroskedasticity with estimates of the variance of the United Kingdom inflation.Econometrica 50, 987–1007

work page 1982

[17] [17]

Engle, R. F. and J. Mezrich (1996). GARCH for groups.Risk, 36–40

work page 1996

[18] [18]

Fan, J., N. E. Heckman, and M. P. Wand (1995). Local polynomial kernel regression for gen- eralized linear models and quasi-likelihood functions.Journal of the American Statistical Association 90, 141–150

work page 1995

[19] [19]

Fan, J. and Q. Yao (2005).Nonlinear Time Series. New York: Springer

work page 2005

[20] [20]

Horvath, and J.-M

Francq, C., L. Horvath, and J.-M. Zako¨ ıan (2013). Merits and drawbacks of variance target- ing in garch models.Journal of Financial Econometrics 9, 619–656

work page 2013

[21] [21]

Gallant, A. R. (1987).Nonlinear Statistical Models. New York: John Wiley

work page 1987

[22] [22]

Harvey, A. C. (1989).Forecasting, Structural Time Series Models and the Kalman Filter. Cambridge: Cambridge University Press

work page 1989

[23] [23]

Harvey, A. C. (2013).Dynamic models for volatility and heavy tails: With applications to financial and economic time series. Cambridge University Press

work page 2013

[24] [24]

Hoerl, A. E. and R. W. Kennard (1970). Ridge regression: Biased estimation for nonorthog- onal problems.Technometrics 12, 55––67

work page 1970

[25] [25]

Holmes, C. C. and S. G. Walker (2017). Assigning a value to a power likelihood in a general bayesian model.Biometrika 104, 497–503

work page 2017

[26] [26]

Hu, F. and J. V. Zidek (2002). The weighted likelihood.Canadian Journal of Statistics 30(3), 347–371

work page 2002

[27] [27]

Huber, P. J. (1967). The behavior of maximum likelihood estimates under nonstandard conditions. InProceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability, pp. 221–233. University of California Press. 35

work page 1967

[28] [28]

Janson, S. (2021). A central limit theorem for m-dependent variables. Unpublished paper: Department of Mathematics, Uppsala University

work page 2021

[29] [29]

Li, W. K. (1994). Time series models based on generalized linear models: Some further results.Biometrics 50, 506–511

work page 1994

[30] [30]

Luxenberg, E. and S. Boyd (2024). Exponentially weighted moving models. Unpublished paper: Stanford University

work page 2024

[31] [31]

McCullagh, P. and J. A. Nelder (1989).Generalized Linear Models(2 ed.). London: Chap- man & Hall

work page 1989

[32] [32]

Newey, W. K. and D. McFadden (1994). Large sample estimation and hypothesis testing. In R. F. Engle and D. McFadden (Eds.),The Handbook of Econometrics, Volume 4, pp. 2111–2245. North-Holland

work page 1994

[33] [33]

Tibshirani, R. (1996). Regression shrinkage and selection via the Lasso.Journal of the Royal Statistical Society, Series B 58, 267–288

work page 1996

[34] [34]

Wedderburn, R. W. M. (1974). Quasi-likelihood functions, generalized linear models and the Gauss-Newton methods.Biometrika 61, 439–47

work page 1974

[35] [35]

White, H. (1982). Maximum likelihood estimation of misspecified models.Econometrica 50, 1–25

work page 1982

[36] [36]

Zeger, S. L. and B. Qaqish (1988). Markov regression models for time series, a quasi likelihood approach.Biometrics 44, 1019–1032. 36

work page 1988