Proper scoring rules for estimation and forecast evaluation

Johanna Ziegel; Kartik Waghmare

arxiv: 2504.01781 · v4 · submitted 2025-04-02 · 🧮 math.ST · stat.ML· stat.TH

Proper scoring rules for estimation and forecast evaluation

Kartik Waghmare , Johanna Ziegel This is my paper

Pith reviewed 2026-05-22 21:56 UTC · model grok-4.3

classification 🧮 math.ST stat.MLstat.TH

keywords proper scoring rulesforecast evaluationprobabilistic estimationcharacterization resultsstatisticsmachine learningprobabilistic forecasts

0 comments

The pith

Proper scoring rules are characterized by mathematical properties that enable their use for both estimating distributions and evaluating probabilistic forecasts.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper reviews the mathematical foundations of proper scoring rules, including general characterization results and important families of scoring rules. It discusses their role in statistics and machine learning for estimation and forecast evaluation. A sympathetic reader would care because these rules ensure that forecasters are incentivized to report their true beliefs, leading to more reliable assessments in predictive modeling. The review also comments on developments in applications of these rules.

Core claim

This article reviews the mathematical foundations of proper scoring rules including general characterization results and important families of scoring rules. We discuss their role in statistics and machine learning for estimation and forecast evaluation. Furthermore, we comment on interesting developments of their usage in applications.

What carries the argument

Proper scoring rules, which are scoring rules such that the expected score is maximized precisely when the reported distribution equals the true distribution.

If this is right

Proper scoring rules can serve as objective functions for estimating parameters in statistical models.
They provide a consistent basis for comparing the accuracy of different probabilistic forecasts.
Characterization theorems allow systematic construction of new scoring rules with specified properties.
Applications in machine learning can leverage these rules for training models that output probability distributions.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The reviewed foundations may support extensions to new application areas such as sequential decision problems.
Standardized use of proper scoring rules could improve comparability across different forecasting studies.
Further work might connect these rules to optimization techniques in high-dimensional settings.

Load-bearing premise

The cited literature on characterization results and families of scoring rules is accurately and comprehensively summarized without material omissions or misrepresentations of prior theorems.

What would settle it

Identification of a major characterization result or family of proper scoring rules omitted from the review would indicate incompleteness in the summary.

read the original abstract

Proper scoring rules have been a subject of growing interest in recent years, not only as tools for evaluation of probabilistic forecasts but also as methods for estimating probability distributions. In this article, we review the mathematical foundations of proper scoring rules including general characterization results and important families of scoring rules. We discuss their role in statistics and machine learning for estimation and forecast evaluation. Furthermore, we comment on interesting developments of their usage in applications.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

0 major / 0 minor

Summary. The manuscript reviews the mathematical foundations of proper scoring rules, covering general characterization results and important families of scoring rules. It discusses their applications in statistics and machine learning for estimation and forecast evaluation, and comments on recent developments in applications.

Significance. As a review consolidating established results on proper scoring rules and their use in estimation and evaluation, the paper could provide a helpful reference point for researchers in mathematical statistics and machine learning if the cited literature is represented accurately.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for their positive review and recommendation to accept the manuscript. We appreciate the assessment that the paper consolidates established results on proper scoring rules and may serve as a helpful reference for researchers in mathematical statistics and machine learning.

Circularity Check

0 steps flagged

Review paper summarizing established literature with no new derivations or self-referential claims

full rationale

This manuscript is explicitly a review article whose purpose is to summarize mathematical foundations, characterization results, and families of proper scoring rules from the existing literature, along with their applications in statistics and machine learning. No new theorems, derivations, predictions, or empirical claims are asserted that could reduce to the paper's own inputs, fitted parameters, or self-citations by construction. The load-bearing content is accurate representation of cited prior work, which is independent of the present paper. No steps meet any of the enumerated circularity patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

As a review paper, the work relies entirely on prior literature for its content and introduces no new free parameters, axioms, or invented entities.

pith-pipeline@v0.9.0 · 5584 in / 1030 out tokens · 21287 ms · 2026-05-22T21:56:42.022348+00:00 · methodology

discussion (0)

Forward citations

Cited by 3 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Agentic Forecasting using Sequential Bayesian Updating of Linguistic Beliefs
cs.AI 2026-04 unverdicted novelty 6.0

BLF achieves state-of-the-art binary forecasting on ForecastBench by using linguistic belief states updated in tool-use loops, hierarchical multi-trial logit averaging, and hierarchical Platt scaling calibration.
Forecasting Commencing Enrolments Under Data Sparsity: A Zero-Shot Time Series Foundation Models Framework for Higher Education Planning
cs.AI 2026-02 unverdicted novelty 6.0

Zero-shot TSFMs conditioned on leakage-safe covariates from Google Trends and an institutional index forecast commencing enrolments competitively with classical methods under data sparsity.
Multivariate Uncertainty Quantification with Tomographic Quantile Forests
cs.LG 2025-12 unverdicted novelty 6.0

Tomographic Quantile Forests estimate multivariate conditional distributions nonparametrically by training one model on directional quantiles and reconstructing via sliced Wasserstein minimization.

Reference graph

Works this paper leans on

16 extracted references · 16 canonical work pages · cited by 3 Pith papers · 1 internal anchor

[1]

arXiv:2502.02483

URL https://arxiv.org/abs/2502.02483. arXiv:2502.02483. Diane Bouchacourt, Pawan K. Mudigonda, and Sebastian Nowozin. DISCO nets: Dis- similarity coefficient networks. In Advances in Neural Information Processing Sys- tems, volume 29, pages 352–360, 2016. URL https://papers.nips.cc/paper/ 6143-disco-nets-dissimilarity-coefficients-networks . Jonas R. Breh...

work page doi:10.1214/19-ejs1622 2016
[2]

Haoqun Cao, Zizhuo Meng, Tianjun Ke, and Feng Zhou

URL https://openreview.net/forum?id=orKA6gJwlB. Haoqun Cao, Zizhuo Meng, Tianjun Ke, and Feng Zhou. Is score matching suitable for estimating point processes? In The Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024. URL https://openreview.net/forum?id=HQgHCVZiHw. Arthur Carvalho. An overview of applications of proper scoring ...

work page 2024
[3]

Jieyu Chen, Tim Janke, Florian Steinke, and Sebastian Lerch

URL https://doi.org/10.1287/deca.2016.0337. Jieyu Chen, Tim Janke, Florian Steinke, and Sebastian Lerch. Generative machine learning methods for multivariate ensemble postprocessing. Ann. Appl. Stat. , 18:159–183, 2024. URL https: //doi.org/10.1214/23-AOAS1784. Yo Joong Choe and Aaditya Ramdas. Comparing sequential forecasters. Oper. Res., 72:1368–1387,

work page doi:10.1287/deca.2016.0337 2016
[4]

24 Dombry Clement and Ahmed Zaoui

URL https://doi.org/10.1287/opre.2021.0792. 24 Dombry Clement and Ahmed Zaoui. Distributional regression: CRPS-error bounds for model fit- ting, model selection and convex aggregation. In The Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024. URL https://openreview.net/forum?id=cSfxzCozPU. Michael Collins, Robert E. Schapire, ...

work page doi:10.1287/opre.2021.0792 2021
[5]

Chao Gao, Yuan Yao, and Weizhi Zhu

URL https://doi.org/10.1016/j.laa.2014.08.015. Chao Gao, Yuan Yao, and Weizhi Zhu. Generative adversarial nets for robust scatter estimation: A proper scoring rule perspective. J. Mach. Learn. Res., 21:1–48, 2020. URL http://jmlr.org/ papers/v21/19-462.html. Jan Gasthaus, Konstantinos Benidis, Yuyang Wang, Syama Sundar Rangapuram, David Sali- nas, Valenti...

work page doi:10.1016/j.laa.2014.08.015 2014
[6]

Measuring information and uncertainty,

URL https://doi.org/10.1198/jasa.2011.r10138. Tilmann Gneiting and Matthias Katzfuss. Probabilistic forecasting. Annu. Rev. Stat. Appl. , 1: 125–151, 2014. URL https://doi.org/10.1146/annurev-statistics-062713-085831. 26 Tilmann Gneiting and Adrian E. Raftery. Strictly proper scoring rules, prediction, and es- timation. J. Amer. Statist. Assoc. , 102:359–...

work page doi:10.1198/jasa.2011.r10138 2011
[7]

Arthur Gretton, Karsten Borgwardt, Malte Rasch, Bernhard Sch¨ olkopf, and Alex Smola

URL https://doi.org/10.1002/(SICI)1097-0258(19990915/30)18:17/18<2529:: AID-SIM274>3.0.CO;2-5. Arthur Gretton, Karsten Borgwardt, Malte Rasch, Bernhard Sch¨ olkopf, and Alex Smola. A kernel method for the two-sample-problem. In B. Sch¨ olkopf, J. Platt, and T. Hoff- man, editors, Advances in Neural Information Processing Systems , volume 19. MIT Press, 20...

work page doi:10.1002/(sici)1097-0258(19990915/30)18:17/18 2006
[8]

Weak convergence of stochastic integrals driven by continuous-time random walks

URL https://doi.org/10.1016/j.csda.2006.09.003. Zacharia Issa, Blanka Horvath, Maud Lemercier, and Cristopher Salvi. Non-adversarial training of neural SDEs with signature kernel scores. In Thirty-seventh Conference on Neural Information Processing Systems, 2023. URL https://openreview.net/forum?id=ixcsBZw5pl. Floyd A. Jensen and Cameron R. Peterson. Psyc...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.1016/j.csda.2006.09.003 2006
[9]

29 Metaculus

URL https://www.pnas.org/doi/abs/10.1073/pnas.42.9.654. 29 Metaculus. Metaculus Scores FAQ, 2025. URL https://www.metaculus.com/help/scores-faq/. https://www.metaculus.com/help/scores-faq/, Accessed: 2025-03-03. Thibault Modeste and Cl´ ement Dombry. Characterization of translation invariant MMD on Rd and connections with Wasserstein distances. J. Mach. L...

work page doi:10.1073/pnas.42.9.654 2025
[10]

David Pfau

URL https://doi.org/10.1080/07350015.2019.1585256. David Pfau. A generalized bias-variance decomposition for Bregman divergences, 2013. URL http://davidpfau.com/assets/generalized_bvd_proof.pdf. http://davidpfau.com/ assets/generalized_bvd_proof.pdf, Accessed: 2025-02-25. Romain Pic, Cl´ ement Dombry, Philippe Naveau, and Maxime Taillardat. Distributional...

work page doi:10.1080/07350015.2019.1585256 2019
[11]

David Rindt, Robert Hu, David Steinsaltz, and Dino Sejdinovic

URL http://eudml.org/doc/28680. David Rindt, Robert Hu, David Steinsaltz, and Dino Sejdinovic. Survival regression with proper scoring rules and monotonic neural networks. In Gustau Camps-Valls, Francisco J. R. Ruiz, and Isabel Valera, editors, Proceedings of The 25th International Conference on Artificial Intelli- gence and Statistics, volume 151 of Proc...

work page 2022
[12]

Zolt´ an Sasv´ ari.Multivariate characteristic and correlation functions , volume 50 of De Gruyter Studies in Mathematics

URL https://proceedings.mlr.press/v162/rolland22a.html. Zolt´ an Sasv´ ari.Multivariate characteristic and correlation functions , volume 50 of De Gruyter Studies in Mathematics . Walter de Gruyter & Co., Berlin, 2013. URL https://doi.org/10. 1515/9783110223996. Leonard J. Savage. Elicitation of personal probabilities and expectations. J. Amer. Statist. A...

work page doi:10.1214/aos/1176347398 2013
[13]

Reinhard Selten

URL https://doi.org/10.1214/13-AOS1140. Reinhard Selten. Axiomatic characterization of the quadratic scoring rule. Exp. Econ., 1:43–61,

work page doi:10.1214/13-aos1140
[14]

Chenze Shao, Fandong Meng, Yijin Liu, and Jie Zhou

URL https://doi.org/10.1023/A:1009957816843. Chenze Shao, Fandong Meng, Yijin Liu, and Jie Zhou. Language generation with strictly proper scoring rules. In Proceedings of the 41st International Conference on Machine Learning, ICML’24. JMLR, 2024. URL https://openreview.net/forum?id=LALSZ88Xpx. Stephane Shao, Pierre E. Jacob, Jie Ding, and Vahid Tarokh. Ba...

work page doi:10.1023/a:1009957816843 2024
[15]

Suppose that S(Pc, cy) = cαS(P, y) for every c > 0. By Theorem 20, we have 0 = (c2 − cα)(y − mP)⊤B(y − mP), (27) 0 = cα Z Rd |eiu⊤y − fP(u)|2 dµ(u) − Z Rd |eicu⊤y − fP(cu)|2 dµ(u), (28) since the L` evy-Khinchine decomposition is unique. By doing a change of variables in the second equation, we get Z Rd |eiu⊤y − fP(u)|2 h cα dµ(u) − dµ(u/c) i = 0. 39 For ...

work page
[16]

for every rotation U ∈ SO(d)

If S(PU, Uy) = S(P, y), then 0 = (y − mP)⊤[B − U⊤BU](y − mP), 0 = Z Rd |eiu⊤y − fP(u)|2 dµ(u) − Z Rd |ei(U⊤u)⊤y − fP(U⊤u)|2 dµ(u). for every rotation U ∈ SO(d). Arguing as previously, it follows that B = cI for some c ≥ 0. Moreover, we get that dνr(Uσ) dρ(r) = d νr(σ) dρ(r). Integrating over r reveals that νr is invariant under rotation, that is, d νr(Uσ)...

work page

[1] [1]

arXiv:2502.02483

URL https://arxiv.org/abs/2502.02483. arXiv:2502.02483. Diane Bouchacourt, Pawan K. Mudigonda, and Sebastian Nowozin. DISCO nets: Dis- similarity coefficient networks. In Advances in Neural Information Processing Sys- tems, volume 29, pages 352–360, 2016. URL https://papers.nips.cc/paper/ 6143-disco-nets-dissimilarity-coefficients-networks . Jonas R. Breh...

work page doi:10.1214/19-ejs1622 2016

[2] [2]

Haoqun Cao, Zizhuo Meng, Tianjun Ke, and Feng Zhou

URL https://openreview.net/forum?id=orKA6gJwlB. Haoqun Cao, Zizhuo Meng, Tianjun Ke, and Feng Zhou. Is score matching suitable for estimating point processes? In The Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024. URL https://openreview.net/forum?id=HQgHCVZiHw. Arthur Carvalho. An overview of applications of proper scoring ...

work page 2024

[3] [3]

Jieyu Chen, Tim Janke, Florian Steinke, and Sebastian Lerch

URL https://doi.org/10.1287/deca.2016.0337. Jieyu Chen, Tim Janke, Florian Steinke, and Sebastian Lerch. Generative machine learning methods for multivariate ensemble postprocessing. Ann. Appl. Stat. , 18:159–183, 2024. URL https: //doi.org/10.1214/23-AOAS1784. Yo Joong Choe and Aaditya Ramdas. Comparing sequential forecasters. Oper. Res., 72:1368–1387,

work page doi:10.1287/deca.2016.0337 2016

[4] [4]

24 Dombry Clement and Ahmed Zaoui

URL https://doi.org/10.1287/opre.2021.0792. 24 Dombry Clement and Ahmed Zaoui. Distributional regression: CRPS-error bounds for model fit- ting, model selection and convex aggregation. In The Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024. URL https://openreview.net/forum?id=cSfxzCozPU. Michael Collins, Robert E. Schapire, ...

work page doi:10.1287/opre.2021.0792 2021

[5] [5]

Chao Gao, Yuan Yao, and Weizhi Zhu

URL https://doi.org/10.1016/j.laa.2014.08.015. Chao Gao, Yuan Yao, and Weizhi Zhu. Generative adversarial nets for robust scatter estimation: A proper scoring rule perspective. J. Mach. Learn. Res., 21:1–48, 2020. URL http://jmlr.org/ papers/v21/19-462.html. Jan Gasthaus, Konstantinos Benidis, Yuyang Wang, Syama Sundar Rangapuram, David Sali- nas, Valenti...

work page doi:10.1016/j.laa.2014.08.015 2014

[6] [6]

Measuring information and uncertainty,

URL https://doi.org/10.1198/jasa.2011.r10138. Tilmann Gneiting and Matthias Katzfuss. Probabilistic forecasting. Annu. Rev. Stat. Appl. , 1: 125–151, 2014. URL https://doi.org/10.1146/annurev-statistics-062713-085831. 26 Tilmann Gneiting and Adrian E. Raftery. Strictly proper scoring rules, prediction, and es- timation. J. Amer. Statist. Assoc. , 102:359–...

work page doi:10.1198/jasa.2011.r10138 2011

[7] [7]

Arthur Gretton, Karsten Borgwardt, Malte Rasch, Bernhard Sch¨ olkopf, and Alex Smola

URL https://doi.org/10.1002/(SICI)1097-0258(19990915/30)18:17/18<2529:: AID-SIM274>3.0.CO;2-5. Arthur Gretton, Karsten Borgwardt, Malte Rasch, Bernhard Sch¨ olkopf, and Alex Smola. A kernel method for the two-sample-problem. In B. Sch¨ olkopf, J. Platt, and T. Hoff- man, editors, Advances in Neural Information Processing Systems , volume 19. MIT Press, 20...

work page doi:10.1002/(sici)1097-0258(19990915/30)18:17/18 2006

[8] [8]

Weak convergence of stochastic integrals driven by continuous-time random walks

URL https://doi.org/10.1016/j.csda.2006.09.003. Zacharia Issa, Blanka Horvath, Maud Lemercier, and Cristopher Salvi. Non-adversarial training of neural SDEs with signature kernel scores. In Thirty-seventh Conference on Neural Information Processing Systems, 2023. URL https://openreview.net/forum?id=ixcsBZw5pl. Floyd A. Jensen and Cameron R. Peterson. Psyc...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.1016/j.csda.2006.09.003 2006

[9] [9]

29 Metaculus

URL https://www.pnas.org/doi/abs/10.1073/pnas.42.9.654. 29 Metaculus. Metaculus Scores FAQ, 2025. URL https://www.metaculus.com/help/scores-faq/. https://www.metaculus.com/help/scores-faq/, Accessed: 2025-03-03. Thibault Modeste and Cl´ ement Dombry. Characterization of translation invariant MMD on Rd and connections with Wasserstein distances. J. Mach. L...

work page doi:10.1073/pnas.42.9.654 2025

[10] [10]

David Pfau

URL https://doi.org/10.1080/07350015.2019.1585256. David Pfau. A generalized bias-variance decomposition for Bregman divergences, 2013. URL http://davidpfau.com/assets/generalized_bvd_proof.pdf. http://davidpfau.com/ assets/generalized_bvd_proof.pdf, Accessed: 2025-02-25. Romain Pic, Cl´ ement Dombry, Philippe Naveau, and Maxime Taillardat. Distributional...

work page doi:10.1080/07350015.2019.1585256 2019

[11] [11]

David Rindt, Robert Hu, David Steinsaltz, and Dino Sejdinovic

URL http://eudml.org/doc/28680. David Rindt, Robert Hu, David Steinsaltz, and Dino Sejdinovic. Survival regression with proper scoring rules and monotonic neural networks. In Gustau Camps-Valls, Francisco J. R. Ruiz, and Isabel Valera, editors, Proceedings of The 25th International Conference on Artificial Intelli- gence and Statistics, volume 151 of Proc...

work page 2022

[12] [12]

Zolt´ an Sasv´ ari.Multivariate characteristic and correlation functions , volume 50 of De Gruyter Studies in Mathematics

URL https://proceedings.mlr.press/v162/rolland22a.html. Zolt´ an Sasv´ ari.Multivariate characteristic and correlation functions , volume 50 of De Gruyter Studies in Mathematics . Walter de Gruyter & Co., Berlin, 2013. URL https://doi.org/10. 1515/9783110223996. Leonard J. Savage. Elicitation of personal probabilities and expectations. J. Amer. Statist. A...

work page doi:10.1214/aos/1176347398 2013

[13] [13]

Reinhard Selten

URL https://doi.org/10.1214/13-AOS1140. Reinhard Selten. Axiomatic characterization of the quadratic scoring rule. Exp. Econ., 1:43–61,

work page doi:10.1214/13-aos1140

[14] [14]

Chenze Shao, Fandong Meng, Yijin Liu, and Jie Zhou

URL https://doi.org/10.1023/A:1009957816843. Chenze Shao, Fandong Meng, Yijin Liu, and Jie Zhou. Language generation with strictly proper scoring rules. In Proceedings of the 41st International Conference on Machine Learning, ICML’24. JMLR, 2024. URL https://openreview.net/forum?id=LALSZ88Xpx. Stephane Shao, Pierre E. Jacob, Jie Ding, and Vahid Tarokh. Ba...

work page doi:10.1023/a:1009957816843 2024

[15] [15]

Suppose that S(Pc, cy) = cαS(P, y) for every c > 0. By Theorem 20, we have 0 = (c2 − cα)(y − mP)⊤B(y − mP), (27) 0 = cα Z Rd |eiu⊤y − fP(u)|2 dµ(u) − Z Rd |eicu⊤y − fP(cu)|2 dµ(u), (28) since the L` evy-Khinchine decomposition is unique. By doing a change of variables in the second equation, we get Z Rd |eiu⊤y − fP(u)|2 h cα dµ(u) − dµ(u/c) i = 0. 39 For ...

work page

[16] [16]

for every rotation U ∈ SO(d)

If S(PU, Uy) = S(P, y), then 0 = (y − mP)⊤[B − U⊤BU](y − mP), 0 = Z Rd |eiu⊤y − fP(u)|2 dµ(u) − Z Rd |ei(U⊤u)⊤y − fP(U⊤u)|2 dµ(u). for every rotation U ∈ SO(d). Arguing as previously, it follows that B = cI for some c ≥ 0. Moreover, we get that dνr(Uσ) dρ(r) = d νr(σ) dρ(r). Integrating over r reveals that νr is invariant under rotation, that is, d νr(Uσ)...

work page