pith. sign in

arxiv: 2606.08692 · v1 · pith:ZJNWXTMVnew · submitted 2026-06-07 · 📊 stat.AP

Logistic Credibility with Temporal Decay: Extending B\"uhlmann--Straub for Commercial Lines

Pith reviewed 2026-06-27 17:37 UTC · model grok-4.3

classification 📊 stat.AP
keywords credibility theoryBuhlmann-Straublogistic regressionEWMA decaycommercial autotemporal weightingcalibrationprediction error
0
0 comments X

The pith

Modeling credibility weights as a logistic function of account characteristics with data-driven EWMA decay restores calibration and cuts prediction error by 38 percent versus standard Buhlmann-Straub.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that the classic Buhlmann-Straub formula, which uses one fixed portfolio-wide K and equal weighting of all past years, produces badly miscalibrated predictions for small accounts on commercial auto data. It replaces the fixed weight with a logistic function of observable account features and replaces equal weighting with an EWMA decay rate that is allowed to differ by size band. Both the logistic coefficients and the decay parameters are estimated together with the complement rate inside a single maximum-likelihood step that formally nests Buhlmann-Straub as the special case of constant logistic and zero decay. On a two-year held-out test set the new model brings the calibration slope to 1.00 and lowers exposure-weighted error by 38 percent.

Core claim

The credibility weight Z_i is expressed as a logistic function of account characteristics, historical experience is discounted by an estimated EWMA decay rate λ that varies by size, and all parameters including the complement are estimated together by maximum likelihood, allowing a formal test against Buhlmann-Straub.

What carries the argument

Logistic function for credibility weights combined with size-specific EWMA temporal decay inside a single likelihood optimization that nests Buhlmann-Straub.

If this is right

  • The framework admits a likelihood-ratio test of any proposed extension against standard Buhlmann-Straub.
  • Estimated decay rates show a clear size gradient that replicates on a second line of business.
  • Only account-year summaries are required; no individual claim records are needed.
  • The procedure returns three transparent quantities for each account: credibility weight, complement, and recommended renewal rate.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same joint-estimation structure could be used to test other link functions or decay specifications on the same data.
  • Size-dependent temporal weighting may be worth exploring in credibility models outside insurance.
  • The single-likelihood nesting makes it straightforward to compare the logistic-EWMA version against other parametric extensions of Buhlmann-Straub.

Load-bearing premise

The logistic function of account characteristics together with size-specific EWMA decay rates estimated in a single likelihood pass are assumed to capture the relevant heterogeneity without introducing bias or overfitting on the training accounts.

What would settle it

A new held-out dataset from the same line of business on which the calibration slope remains far from 1.00 or the exposure-weighted prediction error shows no material reduction would falsify the reported improvement.

Figures

Figures reproduced from arXiv: 2606.08692 by Jake Morris.

Figure 1
Figure 1. Figure 1: Loss ratio distributions for small (red), mid (orange), and large (green) companies across all [PITH_FULL_IMAGE:figures/full_fig_p006_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Calibration slope (NEP-weighted OLS of actual on predicted, intercept included; ideal = 1.00) for [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Distribution of company-level mean relative loss ratio (own multi-year mean [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Year-on-year rank correlation of absolute (grey) and relative (blue) loss ratios by lag, split by [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Account-level credibility weight Zi , sorted by net earned premium. Coloured points: logistic model (scalar λ, shown for illustration) with 95% posterior CI. Grey crosses: Bühlmann–Straub point estimate (no uncertainty). Small accounts (red, left): wide intervals — genuine uncertainty about how much experience to trust. Large accounts (green, right): narrow intervals — departures from the model estimate ca… view at source ↗
Figure 6
Figure 6. Figure 6: Prediction error (wMSE, % change vs standard B-S) at each sequential patch step, by size tercile. [PITH_FULL_IMAGE:figures/full_fig_p010_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Illustrative credibility weight curves by industry segment. Each coloured curve shows [PITH_FULL_IMAGE:figures/full_fig_p013_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Left: Bühlmann–Straub (grey dashed) and logistic with a = − ln K, b = 1 (blue solid) are identical at every exposure level — the nesting identity. Centre: Freeing b controls gradient steepness; b > 1 gives large accounts more credibility faster, b < 1 flattens the curve. Right: Freeing a shifts the midpoint (effective K); lower effective K means credibility is earned faster across all sizes. Both a and b a… view at source ↗
Figure 9
Figure 9. Figure 9: Relative weight assigned to each historical year as a function of the decay parameter [PITH_FULL_IMAGE:figures/full_fig_p018_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Pre-adoption signal check: EWMA ¯f versus actual next-year relative loss ratio by size tercile (training data, AY 2001–2005). NEP-weighted OLS slope and R2 annotated per panel. Signal is evident across all company size bands in the training data. The signal decays with time, at a rate that differs by account size [PITH_FULL_IMAGE:figures/full_fig_p023_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Spearman rank correlation of relative loss ratio between every pair of accident years (CAS [PITH_FULL_IMAGE:figures/full_fig_p024_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Spearman rank correlation matrices by account size tercile (CAS commercial auto, 96 qualifying [PITH_FULL_IMAGE:figures/full_fig_p024_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Each point is one company: mean relative loss ratio (own average over AY 2001–2005, divided [PITH_FULL_IMAGE:figures/full_fig_p025_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Actual vs expected loss ratio by prediction decile (NEP-weighted), held-out test set (AY 2006–2007). [PITH_FULL_IMAGE:figures/full_fig_p028_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: Calibration slope by size tercile — held-out test set (AY 2006–2007). Ideal slope = 1.00. Bühlmann– [PITH_FULL_IMAGE:figures/full_fig_p029_15.png] view at source ↗
Figure 16
Figure 16. Figure 16: Exposure-weighted MSE as a function of fixed scalar decay [PITH_FULL_IMAGE:figures/full_fig_p030_16.png] view at source ↗
Figure 17
Figure 17. Figure 17: Estimated λ as a function of insurer size (log mean training NEP E¯ i). Dark blue band: continuous λ model posterior mean and 95% CI; coloured triangles with 95% CIs: free tercile estimates (red = Small, orange = Mid, green = Large). All three tercile CIs overlap the continuous band — consistent with a smooth gradient — but LOO-CV and held-out wMSE both favour the tercile specification (Section 3.3.2) [P… view at source ↗
Figure 18
Figure 18. Figure 18: Posterior distributions of λ by size tercile. Large is narrow and well-identified; Small and Mid are wide and largely overlapping (CI values in text below). Economic interpretation. A small company recording modest annual premium generates a claims record with high year-to-year noise relative to the underlying signal. To reliably separate signal from noise, the model must average across multiple years — a… view at source ↗
Figure 19
Figure 19. Figure 19: Z-shape validation: proposed model (tercile- [PITH_FULL_IMAGE:figures/full_fig_p033_19.png] view at source ↗
Figure 20
Figure 20. Figure 20: Pre-adoption signal check for Other Liability: EWMA [PITH_FULL_IMAGE:figures/full_fig_p034_20.png] view at source ↗
Figure 21
Figure 21. Figure 21: Lag-1 predictability of relative loss ratio (training AY 2001–2005 vs test AY 2006–2007) by size [PITH_FULL_IMAGE:figures/full_fig_p036_21.png] view at source ↗
Figure 22
Figure 22. Figure 22: Deviance relative to Bühlmann–Straub across 50 seeds, five representative models (full model [PITH_FULL_IMAGE:figures/full_fig_p038_22.png] view at source ↗
Figure 23
Figure 23. Figure 23: Effective lookback window as a function of EWMA decay rate [PITH_FULL_IMAGE:figures/full_fig_p052_23.png] view at source ↗
Figure 24
Figure 24. Figure 24: Gamma shape parameter vs log-NEP. Fitted slope [PITH_FULL_IMAGE:figures/full_fig_p053_24.png] view at source ↗
Figure 25
Figure 25. Figure 25: Held-out loss ratios (points, coloured by size tercile) with loess-smoothed 95% posterior predictive [PITH_FULL_IMAGE:figures/full_fig_p055_25.png] view at source ↗
Figure 26
Figure 26. Figure 26: Empirical Tier 3 (parameter uncertainty + Gamma process noise) coverage by company size decile [PITH_FULL_IMAGE:figures/full_fig_p056_26.png] view at source ↗
Figure 27
Figure 27. Figure 27: Loss ratio trajectories for a random sample of 12 companies per size tercile (same seed), before [PITH_FULL_IMAGE:figures/full_fig_p057_27.png] view at source ↗
Figure 28
Figure 28. Figure 28: Prediction error (log-wMSE, % change vs standard B-S) at each sequential patch step, by size [PITH_FULL_IMAGE:figures/full_fig_p059_28.png] view at source ↗
Figure 29
Figure 29. Figure 29: compares the implied Z from all three frameworks. The logistic is nearly flat (Z ≈ 0.60–0.67); the GLMM-implied Z rises steeply (≈ 0.14 → 0.33 → 0.37 across terciles); standard B-S steeper still, reaching Z > 0.8 for the largest accounts — the pooled-K constraint discussed in Section 3.7 [PITH_FULL_IMAGE:figures/full_fig_p061_29.png] view at source ↗
Figure 30
Figure 30. Figure 30: Fitted logistic Z curves for the balanced panel (96 companies, solid) and the expanded panel (all companies with ≥ 2 training years, dashed), plotted against log lookback exposure log E˜ i . Vertical lines mark the tercile breaks. The two curves are nearly identical in the mid-to-large range; the main difference is at small exposures, where the expanded panel includes less predictable entrants and exiters… view at source ↗
read the original abstract

B\"uhlmann--Straub (B-S) credibility assigns each account a weight $Z_i = E_i/(E_i+K)$, where $K$ is a single portfolio-wide ratio. The formula assumes $K$ is the same for every account regardless of size, history length, or volatility, and that recent and older years carry equal weight. On a held-out US commercial auto dataset these assumptions fail: standard B-S applied to 96 companies produces a calibration slope of 29 for small accounts, a signature of severe under-crediting. We propose a joint framework that retains B-S interpretability while addressing these limitations. The credibility weight $Z_i$ is modelled as a logistic function of account characteristics; historical experience is discounted by an EWMA decay parameter $\lambda$ estimated from the data; and $Z$, $\lambda$, and the complement are optimised in a single likelihood pass. The framework formally nests B\"uhlmann--Straub as a special case, admitting a likelihood-ratio test for any proposed extension. On a two-year held-out test set the proposed model restores calibration (slope = 1.00) and reduces exposure-weighted prediction error by 38% (90% bootstrap interval: 26%--50%). A size gradient in the decay rate emerges ($\hat\lambda \approx 0.6$, $0.84$, $0.13$ for Small, Mid, Large) and replicates qualitatively on Other Liability. A simulation study confirms the mechanisms. The model requires only account-year summaries and delivers three transparent outputs: credibility weight, complement, and recommended renewal rate.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper extends Bühlmann-Straub credibility by replacing the fixed portfolio-wide K with a logistic function of account characteristics for the credibility weight Z_i, adding size-specific EWMA temporal decay rates λ estimated from data, and optimizing Z, λ, and the complement jointly via a single likelihood. The model nests standard B-S as a special case (admitting a likelihood-ratio test) and is evaluated on a two-year held-out US commercial auto dataset, where it reports restored calibration (slope = 1.00) and a 38% reduction in exposure-weighted prediction error (90% bootstrap interval 26%–50%), together with a size gradient in the estimated decay rates.

Significance. If the held-out results hold, the work supplies a practical, interpretable refinement of credibility theory for commercial lines that directly addresses the documented failures of constant-K and equal-history weighting. Credit is due for the explicit nesting of B-S, the use of a disjoint held-out test set with bootstrap intervals on both calibration slope and error reduction, the simulation study confirming mechanisms, and the requirement of only account-year summaries while producing transparent outputs (credibility weight, complement, renewal rate).

major comments (2)
  1. [Abstract] Abstract: the joint single-pass MLE of logistic coefficients together with three size-specific λ values (Small/Mid/Large) is presented without regularization, effective degrees of freedom, or the number of account characteristics; this makes the headline held-out claims (slope = 1.00 and 38% error reduction) vulnerable to optimistic bias from finite-sample noise in the training accounts.
  2. [Abstract] Abstract: the exact likelihood formulation, data exclusion rules, and definition of the size bands are not supplied, so the reported calibration and error metrics on the held-out set cannot be fully reconstructed or stress-tested for misspecification.
minor comments (1)
  1. The reported point estimates λ̂ ≈ 0.6, 0.84, 0.13 lack standard errors or intervals, which would strengthen the claim of a replicable size gradient.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments. Both major comments correctly identify information that is absent from the abstract. We will revise the abstract (and ensure the main text is explicit) to supply the missing details. Point-by-point responses follow.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the joint single-pass MLE of logistic coefficients together with three size-specific λ values (Small/Mid/Large) is presented without regularization, effective degrees of freedom, or the number of account characteristics; this makes the headline held-out claims (slope = 1.00 and 38% error reduction) vulnerable to optimistic bias from finite-sample noise in the training accounts.

    Authors: We agree the abstract should state the number of account characteristics entering the logistic model. The manuscript uses unregularized maximum likelihood; the explicit nesting inside Bühlmann–Straub permits a likelihood-ratio test that guards against gratuitous complexity. The primary safeguard against optimistic bias is the disjoint two-year held-out test set together with bootstrap intervals on both calibration slope and error reduction. We will add the number of characteristics and a brief note on the absence of regularization to the revised abstract. revision: yes

  2. Referee: [Abstract] Abstract: the exact likelihood formulation, data exclusion rules, and definition of the size bands are not supplied, so the reported calibration and error metrics on the held-out set cannot be fully reconstructed or stress-tested for misspecification.

    Authors: The referee is correct that these elements are not stated in the abstract. The full manuscript defines the likelihood (the standard Bühlmann–Straub form with logistic Z and EWMA decay), the account-year data filters, and the exposure-based size bands (Small/Mid/Large). We will insert concise references to these definitions in the revised abstract so that the held-out metrics can be reconstructed from the text alone. revision: yes

Circularity Check

0 steps flagged

No significant circularity; held-out metrics independent of fitting equations

full rationale

The paper defines Z_i as logistic(account characteristics) and introduces size-specific EWMA decay rates λ estimated jointly with the logistic coefficients and complement via single likelihood on training accounts. The headline results (slope = 1.00, 38% error reduction) are reported on a disjoint two-year held-out test set whose observations do not enter the likelihood equations. The nesting of B-S as a special case is a formal restriction (λ = 0 or equivalent) that permits a likelihood-ratio test but does not make the test-set metrics tautological. No self-citation chain, uniqueness theorem, or ansatz imported from prior work is invoked to justify the central performance claims. The derivation therefore remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

2 free parameters · 2 axioms · 0 invented entities

The central claim rests on three fitted components (logistic coefficients, size-specific decay rates, and the complement) plus the assumption that maximum-likelihood estimation on account-year summaries yields stable, generalizable parameters. No new physical entities are postulated.

free parameters (2)
  • logistic coefficients for Z
    Parameters of the logistic function that maps account characteristics to credibility weight; fitted jointly by likelihood.
  • EWMA decay rates lambda
    Three size-specific values (approx. 0.6, 0.84, 0.13) estimated from data rather than fixed at 1.
axioms (2)
  • domain assumption Bühlmann-Straub is recovered exactly when the logistic is constant and lambda equals 1
    Stated in the description of the joint framework that formally nests the original model.
  • domain assumption Account-year summaries contain sufficient information for stable likelihood estimation
    Implicit in the claim that the model requires only those summaries.

pith-pipeline@v0.9.1-grok · 5820 in / 1682 out tokens · 37951 ms · 2026-06-27T17:37:37.948615+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

13 extracted references · 9 canonical work pages

  1. [1]

    Catalina Bolançé, Montserrat Guillén, and Jean Pinquet

    doi: 10.1016/j.insmatheco.2006.02.013. Catalina Bolançé, Montserrat Guillén, and Jean Pinquet. Time-varying credibility for frequency risk models: estimation and tests for autoregressive specifications on the random effects.Insurance: Mathematics and Economics, 33(2):273–282,

  2. [2]

    doi: 10.1016/S0167-6687(03)00139-2. R. L. Bornhuetter and R. E. Ferguson. The actuary and IBNR.Proceedings of the Casualty Actuarial Society, 59:181–195,

  3. [3]

    Hans Bühlmann and Erwin Straub

    doi: 10.1007/3-540-29273-X. Hans Bühlmann and Erwin Straub. Glaubwürdigkeit für schadensätze.Mitteilungen der Vereinigung Schweizerischer Versicherungsmathematiker, 70:111–133,

  4. [4]

    Multilevel calibration weighting for survey data

    doi: 10.18637/jss.v080.i01. Edward W. Frees, Virginia R. Young, and Yu Luo. A longitudinal data analysis interpretation of credibility models.Insurance: Mathematics and Economics, 24(3):229–247,

  5. [5]

    Strictly Proper Scoring Rules, Prediction, and Estimation , volume =

    doi: 10.1198/016214506000001437. Charles A. Hachemeister. Credibility for regression models with application to trend. pages 129–163,

  6. [6]

    R Core Team.R: A Language and Environment for Statistical Computing

    doi: 10.2143/AST.27.1.542963. R Core Team.R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria,

  7. [7]

    Bjørn Sundt

    URL https://mc-stan.org. Bjørn Sundt. A multi-level hierarchical credibility regression model.Scandinavian Actuarial Journal, 1980(1): 25–32,

  8. [8]

    Bjørn Sundt

    doi: 10.1080/03461238.1980.10408635. Bjørn Sundt. Credibility estimators with geometric weights.Insurance: Mathematics and Economics, 7(2): 113–122,

  9. [9]

    doi: 10.1016/0167-6687(88)90104-7. Aad W. van der Vaart.Asymptotic Statistics. Cambridge Series in Statistical and Probabilistic Mathematics. Cambridge University Press, Cambridge,

  10. [10]

    doi: 10.1007/s11222-016-9696-4. 50 §A — Nesting Proof and Rolling Bühlmann–Straub Exposition MLE Consistency Under the B-S Data-Generating Process Proposition (MLE recovery under a B-S data-generating process).Suppose the data are generated by the B-S mechanism with true structural parameterK0: that is, the true credibility weight isZi =wi/(wi+K0), the co...

  11. [11]

    5.7] under mild regularity conditions (compact parameter space, uniform law of large numbers)

    Consistency of the sample estimator then follows from standard M-estimation results [van der Vaart, 2000, Thm. 5.7] under mild regularity conditions (compact parameter space, uniform law of large numbers). The compact parameter space condition is satisfied in any finite portfolio: log-exposurelog ˜Ei is bounded above by the largest account and below by th...

  12. [12]

    draws rows from this table. Calendar-Year Normalisation Figure 27: Loss ratio trajectories for a random sample of 12 companies per size tercile (same seed), before (upper panels) and after (lower panels) calendar-year normalisation. Dashed line = portfolio mean (upper) or 1.0 (lower). The elevated loss-ratio period (AY 2001–2002) is visible in all absolut...

  13. [13]

    Small") df_train$is_md <-as.numeric(df_train$tercile==

    fit_mle <-nlminb(init, nll, df = df_train, control =list(iter.max = 500, rel.tol = 1e-9)) par_hat <-setNames(fit_mle$par,names(init)) Listing 2: Bayesian Fit (brms) The brmsformula implements the recommended Joint-Decay tercile-λspecification (Equations 4–5), estimating one freeλper size tercile. Input data should contain one row per account-year with col...