pith. sign in

arxiv: 2606.24715 · v2 · pith:H7HQIPENnew · submitted 2026-06-23 · 📊 stat.ML · cs.LG

Model selection with proper scoring rules on data sets of time series: prefer the mean scaled score

Pith reviewed 2026-06-25 21:42 UTC · model grok-4.3

classification 📊 stat.ML cs.LG
keywords model selectionproper scoring rulesprobabilistic forecastingtime seriesaggregation methodsscore skewnessintermittent demand
0
0 comments X

The pith

Skewness in score distributions on short test sets causes non-mean aggregation methods to select misspecified probabilistic forecasting models while the mean scaled score stays reliable.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

When selecting among probabilistic forecasting models evaluated on many time series, different ways of combining the scores across series can point to different winners. The discrepancies trace to the skewed shape of the score distributions, which grows stronger when each test set is short. Rank-based summaries, medians, and win rates then favor models that are actually misspecified. Averaging the scores after scaling them avoids the distortion entirely. When test sets become longer the various aggregation rules begin to agree on the same model.

Core claim

The skewness of the distribution of scores, which is particularly pronounced when test sets are short, can cause non-mean criteria such as mean rank, median, and win rate to select misspecified models. In contrast, the mean score is immune from this problem. As the size of the test sets increases, all aggregation criteria converge to the same model selection decision. On intermittent demand data including the M5 competition, the mean scaled score produces stable choices across different scaling factors.

What carries the argument

The mean scaled score, formed by averaging properly scaled proper scoring rule values across time series, which remains unaffected by skewness that distorts rank, median, and win-rate aggregations.

If this is right

  • Mean scaled score selects the correctly specified model even when test sets are short.
  • Rank, median, and win-rate methods can select misspecified models under the same short-test conditions.
  • All aggregation methods reach the same decision once test sets are sufficiently long.
  • The mean scaled score decision stays consistent when different scaling constants are used.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same skewness mechanism could affect model selection in other domains that rely on proper scoring rules over short evaluation windows, such as online learning or sequential decision tasks.
  • Practitioners could test robustness by artificially lengthening or shortening test sets on their own data to check whether aggregation choice changes.
  • If score distributions remain skewed even with moderate test lengths, hybrid rules that first symmetrize the scores before ranking might be worth exploring.

Load-bearing premise

That the disagreements between aggregation methods are produced specifically by skewness in the score distributions rather than by other features of the data or the scoring rules.

What would settle it

A controlled simulation in which score distributions are forced to be symmetric yet non-mean aggregation methods still select different models than the mean, or in which skewness is preserved but the mean and other methods continue to disagree even on large test sets.

Figures

Figures reproduced from arXiv: 2606.24715 by Dario Azzimonti, Giorgio Corani, Lorenzo Zambon, Stefano Damato.

Figure 1
Figure 1. Figure 1: Distribution of the average QS over h = 4 (top) and h = 28 (bottom) steps ahead. The true distribution N (0, 1) is compared against a misspecified distribution N (0, 0.852 ). While the mean of the QS (solid line) is consistently lower for the true distribution, the median (dashed line) is higher when the test set is short. This corresponds to a mean rank exceeding 1.5, leading to wrong model selection. mod… view at source ↗
Figure 2
Figure 2. Figure 2: Comparison of the Poisson and the negative binomial distribution on on the M5 data set, considering different scoring [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗
read the original abstract

We study the problem of model selection among probabilistic forecasting models evaluated on datasets of multiple time series. The performance of a model on a single time series is quantified by the average value (score) of a proper scoring rule over a test set, but extending model selection to data sets of time series requires aggregating these scores. Common approaches either rely on scaling scores and averaging them (mean scaled score) or avoid scaling by using alternative statistics such as mean ranks or win rates. However, these approaches can yield conflicting conclusions. We show that such discrepancies arise from the skewness of the distribution of the scores, which is particularly pronounced when test sets are short. The skewness can cause non-mean criteria (e.g., mean rank, median, win rate) to select misspecified models. In contrast, the mean score is immune from this problem. We further show that, as the size of the test sets increases, all aggregation criteria converge to the same model selection decision, mitigating these discrepancies. Our experiments on intermittent demand time series, including data from the M5 competition, highlight the importance of sufficiently large test sets; the mean scaled score appears to be the more reliable approach, also because empirically we found its decision to remain consistent when different scaling factors are adopted.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper studies model selection among probabilistic forecasting models on datasets of multiple time series. Performance on each series is measured by the average of a proper scoring rule over the test set; aggregation across series is required for overall selection. The authors argue that discrepancies between common aggregation methods (mean scaled score vs. mean rank, median, or win rate) arise specifically from skewness in the per-series score distributions, which is pronounced for short test sets and can cause non-mean methods to select misspecified models. The mean scaled score is claimed to be immune. They further show that all methods converge as test-set length grows and present experiments on intermittent-demand series, including M5 competition data, concluding that the mean scaled score is more reliable and consistent across scaling choices.

Significance. If the central mechanism holds, the work offers practical guidance for aggregating proper scores in multi-series forecasting, especially in intermittent-demand settings where short test sets are typical. It supplies empirical evidence from the M5 dataset and other real series, and notes convergence with larger test sets. The paper does not ship machine-checked proofs or parameter-free derivations, but the reproducible experimental design on public data is a strength.

major comments (2)
  1. [Experiments section] Experiments section (M5 and intermittent-demand results): the reported ranking reversals between mean scaled score and mean-rank/median/win-rate are consistent with skewness but do not isolate it; no controlled simulations with symmetric or low-variance score distributions are described, so the attribution remains correlational. Variance heterogeneity or serial dependence could produce the same reversals, weakening the claim that skewness is the specific load-bearing cause.
  2. [Theoretical mechanism section] Section on theoretical mechanism (around the claim that mean is immune): the argument that skewness alone drives non-mean criteria to favor misspecified models is not accompanied by a derivation or counter-example showing that other distributional features (e.g., heavy tails without skewness) cannot induce the same selection error.
minor comments (2)
  1. [Experiments section] The statement that the mean scaled score decision 'remains consistent when different scaling factors are adopted' would be strengthened by reporting the exact scaling factors tested and the quantitative agreement metric.
  2. [Notation] Notation for the scaled score (e.g., definition of the scaling factor) should be introduced earlier and used consistently when comparing aggregation methods.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We appreciate the referee's detailed review and the opportunity to clarify and strengthen our arguments. Below we respond point-by-point to the major comments.

read point-by-point responses
  1. Referee: [Experiments section] Experiments section (M5 and intermittent-demand results): the reported ranking reversals between mean scaled score and mean-rank/median/win-rate are consistent with skewness but do not isolate it; no controlled simulations with symmetric or low-variance score distributions are described, so the attribution remains correlational. Variance heterogeneity or serial dependence could produce the same reversals, weakening the claim that skewness is the specific load-bearing cause.

    Authors: We acknowledge that the current experiments are observational and do not fully isolate skewness through controlled simulations. To address this, we will add a new subsection with Monte Carlo simulations of score distributions under different conditions: symmetric low-variance, symmetric high-variance, skewed, and with serial dependence. These will demonstrate that ranking reversals occur primarily under skewness, supporting our attribution. This revision will make the causal claim more robust. revision: yes

  2. Referee: [Theoretical mechanism section] Section on theoretical mechanism (around the claim that mean is immune): the argument that skewness alone drives non-mean criteria to favor misspecified models is not accompanied by a derivation or counter-example showing that other distributional features (e.g., heavy tails without skewness) cannot induce the same selection error.

    Authors: The paper's mechanism section explains the effect via the impact of skewness on the ordering induced by ranks and medians versus the mean. While a complete mathematical derivation isolating skewness from all other features (such as kurtosis) is not provided, we can include a simple analytical counter-example in the revision: consider a symmetric heavy-tailed distribution where all aggregation methods select the same model, contrasting with the skewed case where non-mean methods err. This will be added to strengthen the section. revision: yes

Circularity Check

0 steps flagged

No circularity; empirical analysis of score distributions and test-set size effects stands independently of inputs

full rationale

The paper presents an empirical argument that skewness in per-series proper scores (worse for short test sets) causes non-mean aggregators to select misspecified models while the mean scaled score does not; this is supported by direct examination of score distributions on M5 and intermittent-demand data plus convergence results as test-set length grows. No derivation reduces a claimed prediction to a fitted parameter by construction, no self-citation supplies a load-bearing uniqueness theorem, and no ansatz is smuggled via prior work. The central claim is therefore a statistical observation about aggregation behavior rather than a tautological re-expression of the paper's own inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The analysis rests on standard concepts of proper scoring rules and statistical properties of finite-sample score distributions. No free parameters, new entities, or ad-hoc axioms are introduced in the abstract.

axioms (2)
  • standard math Proper scoring rules provide a consistent way to evaluate probabilistic forecasts.
    Core assumption in probabilistic forecasting literature.
  • domain assumption Score distributions on short test sets are skewed.
    Invoked to explain discrepancies in aggregation methods.

pith-pipeline@v0.9.1-grok · 5763 in / 1274 out tokens · 25485 ms · 2026-06-25T21:42:45.929647+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

300 extracted references · 9 canonical work pages

  1. [1]

    International Journal of Forecasting , author=

    The M3 competition: Statistical tests of the results , volume=. International Journal of Forecasting , author=. 2005 , month=. doi:10.1016/j.ijforecast.2004.10.003 , number=

  2. [2]

    Hyndman , year =

    Rob J. Hyndman , year =

  3. [3]

    1979 , publisher=

    Principles of statistics , author=. 1979 , publisher=

  4. [4]

    2022 , note =

    moments: Moments, Cumulants, Skewness, Kurtosis and Related Tests , author =. 2022 , note =

  5. [5]

    2003 , publisher=

    Handbook of parametric and nonparametric statistical procedures, Fifth Edition , author=. 2003 , publisher=

  6. [6]

    International Journal of Forecasting , author =

    M5 accuracy competition:. International Journal of Forecasting , author =. 2022 , pages =

  7. [7]

    International Journal of Forecasting , author =

    The. International Journal of Forecasting , author =. 2005 , pages =

  8. [8]

    Journal of the Operational Research Society , pages=

    Incorporating risk preferences in forecast selection , author=. Journal of the Operational Research Society , pages=. 2026 , publisher=

  9. [9]

    Statistical comparisons of classifiers over multiple data sets , author=

  10. [10]

    International Journal of Production Economics , author =

    Scalable probabilistic forecasting in retail with gradient boosted trees:. International Journal of Production Economics , author =. 2025 , pages =

  11. [11]

    Journal of Computational and graphical Statistics , volume=

    A new tidy data structure to support exploration and modeling of temporal data , author=. Journal of Computational and graphical Statistics , volume=. 2020 , publisher=

  12. [12]

    Computational Statistics & Data Analysis , volume=

    A note on the validity of cross-validation for evaluating autoregressive time series prediction , author=. Computational Statistics & Data Analysis , volume=. 2018 , publisher=

  13. [13]

    European Journal of Operational Research , volume=

    Temporal hierarchies with autocorrelation for load forecasting , author=. European Journal of Operational Research , volume=. 2020 , publisher=

  14. [14]

    Management science , volume=

    Scoring rules for continuous probability distributions , author=. Management science , volume=. 1976 , publisher=

  15. [15]

    2021 , publisher=

    Forecasting: principles and practice, 3rd edition , author=. 2021 , publisher=

  16. [16]

    2024 , publisher=

    Wickramasuriya, Shanika L , journal=. 2024 , publisher=

  17. [17]

    International Journal of Forecasting , volume=

    Likelihood-based inference in temporal hierarchies , author=. International Journal of Forecasting , volume=. 2024 , publisher=

  18. [18]

    arXiv preprint arXiv:1408.4050 , year=

    Bayesian inference for a covariance matrix , author=. arXiv preprint arXiv:1408.4050 , year=

  19. [19]

    Zhang, Zhiyong , journal=. A

  20. [20]

    International Journal of Production Economics , volume=

    Stochastic coherency in forecast reconciliation , author=. International Journal of Production Economics , volume=. 2021 , publisher=

  21. [21]

    Automatic time series forecasting: the forecast package for

    Hyndman, Rob J and Khandakar, Yeasmin , journal=. Automatic time series forecasting: the forecast package for

  22. [22]

    Bayesian Analysis , number =

    Yuling Yao and Aki Vehtari and Daniel Simpson and Andrew Gelman , title =. Bayesian Analysis , number =. 2018 , doi =

  23. [23]

    Johnson , title =

    Steven G. Johnson , title =. 2008 , url =

  24. [24]

    Elgavish, Gal , journal=

  25. [25]

    Gelman, Andrew , journal=

  26. [26]

    Applied Energy , volume=

    Heat load forecasting using adaptive spatial hierarchies , author=. Applied Energy , volume=. 2023 , publisher=

  27. [27]

    European Mathematical Society Magazine , number=

    Reconciling temporal hierarchies of wind power production with forecast-dependent variance structures , author=. European Mathematical Society Magazine , number=

  28. [28]

    Wind Energy , volume=

    Reconciliation of wind power forecasts in spatial hierarchies , author=. Wind Energy , volume=. 2023 , publisher=

  29. [29]

    arXiv preprint arXiv:2402.06480 , year=

    M. arXiv preprint arXiv:2402.06480 , year=

  30. [30]

    2023 , publisher=

    Wang, Xiaoqian and Hyndman, Rob J and Li, Feng and Kang, Yanfei , journal=. 2023 , publisher=

  31. [31]

    Neural Comput , volume=

    Schmidhuber, J. Neural Comput , volume=

  32. [32]

    2024 , author =

    International Journal of Forecasting , volume =. 2024 , author =

  33. [33]

    and Kong, W

    Das, A. and Kong, W. and Paria, B. and Sen, R. , booktitle =. 2023 , volume =

  34. [34]

    2022 , author =

    International Journal of Forecasting , volume =. 2022 , author =

  35. [35]

    and Syntetos, Aris A

    Boylan, John E. and Syntetos, Aris A. , year =. Intermittent demand forecasting: context, methods and applications , language =

  36. [36]

    Journal of Applied Statistics , volume=

    Efficient accounting for estimation uncertainty in coherent forecasting of count processes , author=. Journal of Applied Statistics , volume=. 2022 , publisher=

  37. [37]

    Journal of Business & Economic Statistics , volume=

    Bayesian forecasting of many count-valued time series , author=. Journal of Business & Economic Statistics , volume=. 2020 , publisher=

  38. [38]

    European Journal of Operational Research , volume=

    Understanding forecast reconciliation , author=. European Journal of Operational Research , volume=. 2021 , publisher=

  39. [39]

    doi:10.18637/jss.v082.i05 , language=

    Journal of Statistical Software , author=. doi:10.18637/jss.v082.i05 , language=

  40. [40]

    European Journal of Operational Research , author =

    Combining probabilistic forecasts of. European Journal of Operational Research , author =. 2023 , pages =. doi:10.1016/j.ejor.2021.06.044 , language =

  41. [41]

    2019 , volume =

    Alexander Jordan and Fabian Kr\"uger and Sebastian Lerch , journal =. 2019 , volume =

  42. [42]

    Dario Azzimonti and Nicolò Rubattu and Lorenzo Zambon and Giorgio Corani , year =

  43. [43]

    Ivan Svetunkov , year =

  44. [44]

    Kowal , year =

    Brian King and Daniel R. Kowal , year =

  45. [45]

    International Journal of Forecasting , author =

    Commentary on the. International Journal of Forecasting , author =. 2022 , pages =

  46. [46]

    2026 , eprint=

    fev-bench: A Realistic Benchmark for Time Series Forecasting , author=. 2026 , eprint=

  47. [47]

    , author=

    Bayesian estimation supersedes the t test. , author=. 2013 , publisher=

  48. [48]

    International Journal of Production Research , volume=

    On the demand distributions of spare parts , author=. International Journal of Production Research , volume=. 2012 , publisher=

  49. [49]

    2022 , pages =

    International Journal of Forecasting , author =. 2022 , pages =. doi:10.1016/j.ijforecast.2021.09.008 , language =

  50. [50]

    Statistics and Computing , volume=

    Efficient probabilistic reconciliation of forecasts for real-valued and count time series , author=. Statistics and Computing , volume=. 2024 , publisher=

  51. [51]

    International Journal of Forecasting , volume =

    Properties of the reconciled distributions for. International Journal of Forecasting , volume =. 2024 , author =

  52. [52]

    European Journal of Operational Research , volume=

    Optimal reconciliation with immutable forecasts , author=. European Journal of Operational Research , volume=. 2023 , publisher=

  53. [53]

    Yang, Yangzhuoran Fin and Athanasopoulos, George and Hyndman, Rob J and Panagiotelis, Anastasios , journal=

  54. [54]

    and Tastu, J

    Pinson, P. and Tastu, J. , year=

  55. [55]

    Hollyman, Ross and Petropoulos, Fotios and Tipping, Michael E , journal=

  56. [56]

    Applied Intelligence , volume=

    Mrad, Ali Ben and Delcroix, V. Applied Intelligence , volume=. 2015 , publisher=

  57. [57]

    arXiv preprint arXiv:2210.12236 , year=

    Uncertain Evidence in Probabilistic Models and Stochastic Simulators , author=. arXiv preprint arXiv:2210.12236 , year=

  58. [58]

    1988 , publisher=

    Probabilistic reasoning in intelligent systems: networks of plausible inference , author=. 1988 , publisher=

  59. [59]

    Kolassa, Stephan , journal=

  60. [60]

    Springer handbook of robotics , pages=

    Multisensor data fusion , author=. Springer handbook of robotics , pages=. 2016 , publisher=

  61. [61]

    2003 , publisher=

    Statistical models , author=. 2003 , publisher=

  62. [62]

    2018 , note =

    ZIM: Zero-Inflated Models (ZIM) for Count Time Series with Excess Zeros , author =. 2018 , note =

  63. [63]

    arXiv:2105.06643 , year=

    Monash time series forecasting archive , author=. arXiv:2105.06643 , year=

  64. [64]

    International Journal of Forecasting , author =

    Another look at measures of forecast accuracy , volume =. International Journal of Forecasting , author =. 2006 , pages =

  65. [65]

    Journal of the American Statistical Association , pages=

    Count time series: A methodological review , author=. Journal of the American Statistical Association , pages=. 2021 , publisher=

  66. [66]

    International Journal of Forecasting , volume=

    Principles and algorithms for forecasting groups of time series: Locality and globality , author=. International Journal of Forecasting , volume=. 2021 , publisher=

  67. [67]

    Evaluating Probabilistic Forecasts with

    Alexander Jordan and Fabian Kr\"uger and Sebastian Lerch , journal =. Evaluating Probabilistic Forecasts with. 2019 , volume =

  68. [68]

    Panagiotelis, Anastasios and Gamakumara, Puwasala and Athanasopoulos, George and Hyndman, Rob J , journal=

  69. [69]

    Communications in Statistics-Simulation and Computation , volume=

    Criteria for evaluating approximations of count distributions , author=. Communications in Statistics-Simulation and Computation , volume=. 2020 , publisher=

  70. [70]

    International Journal of Forecasting , volume=

    Quantiles as optimal point forecasts , author=. International Journal of Forecasting , volume=. 2011 , publisher=

  71. [71]

    Probabilistic programming in

    Salvatier, John and Wiecki, Thomas V and Fonnesbeck, Christopher , journal=. Probabilistic programming in. 2016 , publisher=

  72. [72]

    Simulation of correlated

    Barbiero, Alessandro and Ferrari, Pier Alda , journal=. Simulation of correlated. 2015 , publisher=

  73. [73]

    2021 , issn =

    Kaggle forecasting competitions: An overlooked learning opportunity , journal =. 2021 , issn =

  74. [74]

    International Journal of Forecasting , volume=

    Evaluating predictive count data distributions in retail sales forecasting , author=. International Journal of Forecasting , volume=. 2016 , publisher=

  75. [75]

    Botev, Z I , journal =

  76. [76]

    A Novel Shrinkage Estimator of the Covariance Matrix for Hierarchical Time Series

    Carrara, Chiara and Zambon, Lorenzo and Azzimonti, Dario and Corani, Giorgio. A Novel Shrinkage Estimator of the Covariance Matrix for Hierarchical Time Series. Statistics for Innovation I. 2025

  77. [77]

    and Keyes, David E

    Cao, Jian and Genton, Marc G. and Keyes, David E. and Turkiyyah, George M. , title=. Statistics and Computing , year=

  78. [78]

    Akbudak, Kadir and Ltaief, Hatem and Mikhalev, Aleksandr and David, Keyes , journal =

  79. [79]

    Hackbusch, Wolfgang , isbn =

  80. [80]

    , eprint =

    Murray, Iain and Adams, Ryan Prescott and MacKay, David J.C. , eprint =. Journal of Machine Learning Research , number =

Showing first 80 references.