Evaluation Metrics as Averaged Outcomes of Fair Gambles

Rabanus Derr; Robert C. Williamson

arxiv: 2401.14483 · v4 · pith:YSFC2RUEnew · submitted 2024-01-25 · 💻 cs.LG · stat.ML

Evaluation Metrics as Averaged Outcomes of Fair Gambles

Rabanus Derr , Robert C. Williamson This is my paper

classification 💻 cs.LG stat.ML

keywords metricsevaluationfairgamblesabilityaveragedforecastsgambler

0 comments

read the original abstract

In the current practices of machine learning, the evaluation of forecasts has become a cornerstone of scientific progress. A multitude of evaluation metrics have been suggested and used to qualify "good" forecasts. What do those metrics share? How are they related? In this work, we use a protocol borrowed from game-theoretic probability to show that a large part of evaluation metrics can be viewed as averaged outcomes of fair gambles. Intuitively, a fair gambler is one which a forecaster would expect to fail. Hence, the gambler's ability to gain disproves the quality of the forecast. Standard evaluation metrics are then variants of choices of such fair gambles. In particular, this choice is structured along two dimensions, one of which separates calibration-type and regret-type metrics. In particular, this framework sheds light on the relationship of calibration and regret showing a theoretical equivalence in their ability to evaluate when being scaled appropriately, but the incomparability of obtained scores.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Decision-Aligned Evaluation of Uncertainty Quantification
cs.LG 2026-06 unverdicted novelty 6.0

Introduces decision-alignment to evaluate uncertainty metrics against downstream decision utilities and proposes prior-weighted proper scoring rules that align better in benchmarks and case studies.