Beyond Point Estimates: Distributional Uncertainty in Machine Learning Performance Evaluation

Christoph Lehmann; Yahor Paromau

arxiv: 2501.16931 · v2 · submitted 2025-01-28 · 💻 cs.LG · stat.AP

Beyond Point Estimates: Distributional Uncertainty in Machine Learning Performance Evaluation

Christoph Lehmann , Yahor Paromau This is my paper

Pith reviewed 2026-05-23 04:32 UTC · model grok-4.3

classification 💻 cs.LG stat.AP

keywords performance evaluationdistributional uncertaintyquantile estimationnonparametric confidence intervalsmachine learningsmall sample sizesmodel variability

0 comments

The pith

Machine learning performance metrics can be treated as random variables whose quantiles are estimable with nonparametric confidence intervals from just 10-25 repeated trainings.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Standard point estimates of accuracy or error hide the variability that comes from random data splits, weight initialization, and hyperparameter choices during training. The paper instead collects multiple performance scores from repeated trainings and examines their empirical distribution through quantiles together with nonparametric confidence intervals around those quantiles. It demonstrates that the resulting interval estimates remain usable and informative even when the number of repetitions is limited to the range of 10-25, a regime typical in practice because each training run is expensive. This distributional view supplies a risk-oriented reading of model behavior that point summaries cannot provide. The methods require no extra modeling assumptions beyond treating the observed scores as exchangeable draws from an underlying performance distribution.

Core claim

By viewing performance metrics as random quantities induced by stochastic training elements, the empirical distribution of scores obtained from repeated model trainings permits both point estimation and interval estimation of quantiles; standard nonparametric confidence intervals for these quantiles remain valid and yield useful inference at sample sizes between 10 and 25, thereby furnishing a finer-grained characterization of performance variability than mean-based evaluation alone.

What carries the argument

Empirical distribution of performance metrics from repeated trainings, with quantile estimation and nonparametric confidence intervals applied directly to the observed scores.

If this is right

Model comparisons can rest on entire performance distributions rather than single averages.
Lower quantiles supply a direct measure of worst-case behavior relevant to reliability-critical uses.
Variability induced by data splitting, initialization, and optimization becomes visible without new computational overhead.
The same nonparametric intervals apply across classification and regression tasks without metric-specific adjustments.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The approach could be used to define acceptance thresholds on lower quantiles when deploying models in regulated domains.
It suggests a natural way to budget the number of repetitions according to the desired quantile and interval width rather than a fixed rule of thumb.
Extending the same logic to other stochastic training pipelines, such as those involving data augmentation or continual learning, would require only repeated runs under the same protocol.

Load-bearing premise

Repeated trainings with different random seeds produce exchangeable samples from a single stable performance distribution.

What would settle it

If additional repeated trainings produce performance scores whose empirical distribution shifts systematically or whose order statistics deviate from exchangeability, the quantile estimates and their intervals would lose validity.

read the original abstract

Machine learning models are often evaluated using point estimates of performance metrics such as accuracy, F1 score, or mean squared error. Such summaries fail to capture the inherent variability induced by stochastic elements of the training process, including data splitting, initialization, and hyperparameter optimization. This work proposes a distributional perspective on model evaluation by treating performance metrics as random quantities rather than fixed values. Instead of focusing solely on aggregate measures, empirical distributions of performance metrics are analyzed using quantiles and corresponding confidence intervals. The study investigates point and interval estimation of quantiles based on real-data use cases for classification and regression tasks, complemented by simulation studies for validation. Special emphasis is placed on small sample sizes, reflecting practical constraints in machine learning, where repeated training is computationally expensive. The results show that meaningful statistical inference on the underlying performance distribution is feasible even with sample sizes in the range of 10-25, while standard nonparametric confidence interval remain applicable under these conditions. The proposed approach provides a more detailed characterization of variability and uncertainty compared to mean-based evaluation and enables a more differentiated comparison of models. In particular, it supports a risk-oriented interpretation of model performance, which is relevant in applications where reliability is critical. The presented methods are easy to implement and broadly applicable, making them a practical extension to standard performance evaluation procedures in machine learning.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Applies textbook quantile intervals to ML performance metrics from repeated seeds but adds no new estimators and leans on an exchangeability assumption that may not hold.

read the letter

The core message is that ML papers should report quantiles and nonparametric confidence intervals on performance instead of single means, and that this stays workable down to 10-25 repeated trainings. The authors run simulations and a few real-data cases to back that up. Nothing in the statistics is new; they are using order statistics and standard distribution-free intervals that have been around for decades. The contribution is the reminder that this is feasible in the small-n regime common in deep learning. That part is useful because most labs already rerun models a handful of times but then collapse everything to a mean. The paper shows how to keep more of the information without extra cost. The main soft spot is the exchangeability assumption. Different random seeds can land in different basins or interact with fixed data splits in ways that make the samples neither i.i.d. nor exchangeable, which would invalidate the coverage guarantees of the nonparametric intervals. The manuscript does not appear to test for multimodality or seed-dependent trends, so the claim that inference remains reliable rests on an unexamined modeling choice. The work is incremental rather than foundational. Practitioners who already collect multiple runs will find the reporting suggestions straightforward to adopt. It is coherent on its own terms and engages the right literature on evaluation variability, so it clears the bar for peer review even though the statistical novelty is low.

Referee Report

1 major / 2 minor

Summary. The paper proposes shifting from point estimates to a distributional view of ML performance metrics (accuracy, F1, MSE, etc.), treating them as random variables induced by stochastic training elements. It advocates quantile estimation and standard nonparametric confidence intervals on the empirical performance distribution, with emphasis on feasibility for small sample sizes (n=10-25) from repeated trainings. Claims are supported by simulation studies and real-data experiments on classification and regression tasks; the approach is presented as easy to implement and enabling risk-oriented model comparison.

Significance. If the nonparametric intervals retain valid coverage, the work offers a low-overhead extension to standard evaluation that better captures variability and supports reliability-focused applications. Credit is due for targeting the practical constraint of expensive repeated training and for relying on distribution-free methods whose validity is external to the paper rather than introducing new fitted parameters.

major comments (1)

[Section 3 and simulation setup] Section 3 and simulation setup: the nonparametric quantile CIs are asserted to remain applicable for n=10-25 under the implicit assumption that performance values from different random seeds are exchangeable draws from a stable distribution. Non-convex optimization, shared data splits, and possible convergence to different basins can induce dependence or multimodality, which would invalidate the distribution-free coverage guarantees and thereby undermine the central feasibility claim.

minor comments (2)

[Abstract] Abstract: states that simulation and real-data studies support the claims yet reports no quantitative results, coverage rates, or exclusion criteria, which weakens the reader's ability to gauge effect sizes even though the full manuscript presumably contains them.
[Abstract] Abstract: grammatical error in 'standard nonparametric confidence interval remain applicable' (should be 'intervals remain').

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their constructive comments, which help clarify the assumptions underlying our proposed approach. We address the major comment below.

read point-by-point responses

Referee: [Section 3 and simulation setup] Section 3 and simulation setup: the nonparametric quantile CIs are asserted to remain applicable for n=10-25 under the implicit assumption that performance values from different random seeds are exchangeable draws from a stable distribution. Non-convex optimization, shared data splits, and possible convergence to different basins can induce dependence or multimodality, which would invalidate the distribution-free coverage guarantees and thereby undermine the central feasibility claim.

Authors: We agree that the validity of the nonparametric quantile confidence intervals rests on the exchangeability of the performance metrics obtained from repeated trainings with different random seeds. Our manuscript implicitly relies on this standard assumption for i.i.d.-like sampling in stochastic optimization. In the simulation studies, data are generated under controlled exchangeable conditions, and the real-data experiments use distinct seeds on fixed datasets, which empirically produce exchangeable outcomes in practice. While non-convexity or shared splits could in principle induce dependence or multimodality that violates coverage, our results show that the intervals maintain reasonable coverage for n=10-25 under the evaluated conditions. We will revise Section 3 to explicitly articulate the exchangeability assumption, note potential violations in edge cases, and add a brief discussion of robustness. This does not alter the central feasibility claim for typical ML evaluation settings. revision: partial

Circularity Check

0 steps flagged

No circularity; standard nonparametric methods applied directly to empirical performance samples

full rationale

The paper's core procedure is to collect n=10-25 performance values from repeated trainings, form the empirical distribution, and apply textbook nonparametric quantile confidence intervals (order-statistic based). No equation defines a quantity in terms of itself, no fitted parameter is relabeled as a prediction, and no uniqueness theorem or ansatz is imported via self-citation to close the argument. The exchangeability assumption is stated explicitly as a modeling premise rather than derived from the data; the coverage claims rest on the external validity of classical nonparametric results, not on any reduction internal to the paper.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The paper invokes standard nonparametric statistical assumptions for quantile estimation and confidence-interval construction without introducing new free parameters, axioms beyond those of classical statistics, or invented entities.

axioms (2)

domain assumption Repeated trainings produce i.i.d. or exchangeable samples from a fixed performance distribution
Invoked when treating performance metrics as random quantities whose quantiles can be estimated from 10-25 repeats
domain assumption Standard nonparametric confidence intervals for quantiles remain valid at small sample sizes
Stated directly in the abstract as remaining applicable for n in 10-25

pith-pipeline@v0.9.0 · 5765 in / 1367 out tokens · 21086 ms · 2026-05-23T04:32:26.658056+00:00 · methodology

Beyond Point Estimates: Distributional Uncertainty in Machine Learning Performance Evaluation

Core claim

What carries the argument

If this is right

Where Pith is reading between the lines

Load-bearing premise

What would settle it

discussion (0)