arxiv: 2605.01936 · v1 · submitted 2026-05-03 · 💻 cs.LG

Recognition: unknown

Pandora's Regret: A Proper Scoring Rule for Evaluating Sequential Search

Gerardo A. Flores , Yash Deshpande , Jannis R. Brea , Ashia C. Wilson

Authors on Pith no claims yet

Pith reviewed 2026-05-10 15:27 UTC · model grok-4.3

classification 💻 cs.LG

keywords proper scoring rulessequential searchmulticlass evaluationrank reversalexpected search costcalibrationmedical imaging models

0 comments

The pith

Sequential search costs induce Pandora's Regret, a closed-form strictly proper scoring rule that penalizes rank reversals where distractors outrank the true class.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that sequential search, in which alternatives are tested until the true class appears, creates a natural pairwise structure for scoring. From the expected cost of optimal search under a model's reported probabilities, the authors derive Pandora's Regret as a scoring rule that is minimized exactly when the model outputs true probabilities. Unlike log loss or accuracy, the new rule accounts for the full ranking and therefore punishes cases in which a wrong alternative is ranked above the true class. A one-parameter Beta family of such rules balances the penalty on rank swaps against errors in probability magnitude. Experiments on 597 MedMNIST models show that Pandora-based scores predict actual clinical diagnostic costs more accurately than standard alternatives.

Core claim

Pandora's Regret is obtained by analyzing the expected cost of optimal sequential testing under varying per-test costs and subtracting the cost that would be incurred with true probabilities. The resulting expression is closed-form, pairwise additive, and strictly proper, so that any deviation from true probabilities increases the score. It penalizes rank-reversing miscalibrations in addition to magnitude errors and yields a Beta family whose single parameter trades off the two kinds of penalty while retaining an interpretation as excess search cost.

What carries the argument

Pandora's Regret, the closed-form excess expected cost of optimal sequential search under the model's probabilities, which supplies both the strict properness and the pairwise additive structure.

If this is right

Log loss, accuracy, and macro-F1 each embed an implicit decision model that does not match the sequential-search utility.
Pandora-based metrics can be used to select or tune models when the downstream task is sequential testing.
The Beta family lets practitioners choose how heavily to penalize rank swaps versus probability magnitude while keeping a cost interpretation.
The construction extends the decision-theoretic approach to proper scoring rules from binary to multiclass sequential settings.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar cost-based derivations may produce aligned scoring rules for other sequential decision problems such as adaptive testing or active learning.
In diagnostic pipelines the rule could be used directly as a training objective rather than only for post-hoc evaluation.
If the pairwise structure generalizes, the same method might yield proper rules for partial-information search settings where not all alternatives are tested.

Load-bearing premise

The expected cost of optimal search under the model's probabilities admits a pairwise decomposition whose closed form remains strictly proper for arbitrary testing-cost regimes.

What would settle it

A model that reports the true class probabilities but incurs higher Pandora's Regret than a model that reverses the ranking of the true class and one distractor, on the same test set.

Figures

Figures reproduced from arXiv: 2605.01936 by Ashia C. Wilson, Gerardo A. Flores, Jannis R. Brea, Yash Deshpande.

**Figure 2.** Figure 2: Pandora’s Regret is the α = 1 unit-cost member of a one-parameter Beta(α, 1) family of pairwise-additive scores, strictly proper for every finite α > 0. As α → ∞, the score becomes increasingly rank-focused, and its limit depends only on the weighted ordering of the ratios pk/Ck. As α ↓ 0, the pairwise loss tends to 1 − ln r for r ≤ 1 and r −1 for r > 1. When base costs C1, . . . , CK > 0 are available, th… view at source ↗

**Figure 3.** Figure 3: Confidence intervals for the gap ∆|τ | between each metric and SPandora. Conditions (columns) are described in Section 5.2. The confidence interval for F1 on random temperature rescalings of OCTMNIST predictions overlaps zero; all others are bounded below zero. See [PITH_FULL_IMAGE:figures/full_fig_p009_3.png] view at source ↗

**Figure 4.** Figure 4: Normalized pairwise loss Lα(r) as a function of the odds ratio r = pi/pj for several values of α. All curves pass through Lα(1) = 1. As α → ∞, the loss converges to a step function (pure rank sensitivity). As α → 0, the left branch becomes logarithmic, penalizing underconfident forecasts more broadly. Corollary A.7 (Weighted Beta score). Let ck = Ckuk with Ck > 0 fixed and uk iid∼ Beta(α, 1), where α > 0. … view at source ↗

read the original abstract

In sequential search, alternatives are tested until the true class is found. Standard proper scoring rules like log loss are local, ignoring the ranking of competitors and misaligning model evaluation with search utility. We show that sequential search induces a pairwise structure that overcomes this. By analyzing the expected cost of optimal search under varying testing costs, we derive Pandora's Regret: a closed-form, pairwise-additive, and strictly proper scoring rule. Pandora's Regret both elicits true probabilities and penalizes rank-reversing miscalibrations where distractors outrank the true class. Our construction yields a one-parameter Beta family that balances penalties for rank-swapping versus probability magnitude, while retaining a grounded interpretation as expected search cost. We prove that log loss, accuracy, and macro-F1 rely on implicit decision models misaligned with sequential search. Across 597 MedMNIST models, Pandora-based metrics better predict clinical diagnostic costs than standard alternatives, extending decision-theoretic scoring rule construction to the multiclass setting.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Pandora's Regret derives a proper scoring rule from sequential search costs that better tracks diagnostic expenses than log loss or accuracy, but the pairwise additivity may not hold cleanly when testing costs differ across classes.

read the letter

The core contribution is a scoring rule extracted from the expected cost of optimal sequential testing in Pandora's problem. This produces a closed-form expression that is pairwise additive and strictly proper, so it penalizes both bad probability estimates and rank reversals where a wrong class gets checked first. They also show that log loss, accuracy, and macro-F1 implicitly assume decision models that do not match sequential search, which is a clean theoretical point. On the MedMNIST suite with 597 models the new metrics track actual clinical search costs more closely than the baselines, which is the practical payoff. The Beta family parameter lets users tilt the penalty between rank errors and magnitude errors while keeping the expected-cost interpretation. That combination of derivation and downstream correlation is the useful part. The soft spot is the generality claim. The abstract says the rule works for any testing-cost regime, yet the optimal stopping thresholds in Pandora's problem depend on the full probability vector. When costs are heterogeneous the decisions may not separate into independent pairwise terms, which would break the claimed additivity and the closed-form guarantee. The paper needs to spell out the steps that preserve the structure outside the uniform-cost case. The Beta choice also sits between derivation and tuning; more detail on how it is set would remove any sense of post-hoc adjustment. This work is aimed at people who evaluate models for tasks where testing order carries real cost, such as medical diagnosis or ranked retrieval. Readers who already care about proper scoring rules or decision-theoretic evaluation will see the most value. The idea is novel enough and the empirical gap is large enough that it deserves a full referee process rather than a desk reject. I would send it out, with the main requests being a clearer derivation for non-uniform costs and a sensitivity check on the Beta parameter.

Referee Report

2 major / 1 minor

Summary. The manuscript derives Pandora's Regret as a closed-form, pairwise-additive, strictly proper scoring rule from the expected cost of optimal sequential search under Pandora's problem. It claims this rule elicits true probabilities, penalizes rank-reversing miscalibrations, and belongs to a one-parameter Beta family balancing rank-swap and magnitude penalties while retaining an expected-search-cost interpretation. Standard metrics (log loss, accuracy, macro-F1) are shown to rely on misaligned implicit decision models. Empirical results across 597 MedMNIST models indicate Pandora-based metrics better predict clinical diagnostic costs than alternatives.

Significance. If the derivation is sound, the work meaningfully extends decision-theoretic scoring-rule construction to multiclass sequential search, offering a utility-aligned alternative for applications such as medical diagnosis where ranking and search costs matter. The large-scale empirical comparison on 597 models provides concrete evidence of practical advantage and is a clear strength.

major comments (2)

[Abstract and §3 (derivation)] Abstract and theoretical derivation (Pandora's problem analysis): the central claim that the expected optimal search cost yields a closed-form pairwise-additive strictly proper rule that generalizes to any testing-cost regime is load-bearing. When class-specific costs are heterogeneous, optimal thresholds in Pandora's problem generally depend on the full probability vector in a non-separable manner; this risks breaking the claimed pairwise structure and additivity. Explicit closed-form expression and verification for non-uniform costs are required.
[§4 (Beta family)] Beta-family construction: the one-parameter Beta family is presented as both tunable and grounded in expected search cost, yet the parameter-selection procedure is not detailed. If the choice is post-hoc or data-dependent, it undermines the claim of a parameter-free derivation from first principles and the interpretation as expected cost.

minor comments (1)

[Empirical evaluation] Notation for the Beta parameter and the exact definition of 'clinical diagnostic costs' in the empirical section should be stated explicitly in the main text rather than deferred to supplements.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the insightful comments and for acknowledging the significance of our contribution. We address the major comments point by point below and have made revisions to strengthen the manuscript.

read point-by-point responses

Referee: [Abstract and §3 (derivation)] Abstract and theoretical derivation (Pandora's problem analysis): the central claim that the expected optimal search cost yields a closed-form pairwise-additive strictly proper rule that generalizes to any testing-cost regime is load-bearing. When class-specific costs are heterogeneous, optimal thresholds in Pandora's problem generally depend on the full probability vector in a non-separable manner; this risks breaking the claimed pairwise structure and additivity. Explicit closed-form expression and verification for non-uniform costs are required.

Authors: We thank the referee for pointing out this critical aspect of the derivation. Our analysis in §3 starts with the general case of heterogeneous testing costs in Pandora's problem. Although optimal thresholds can depend on the probability vector, the expected optimal search cost regret decomposes into a sum of pairwise terms because the search continues until the true class is found, and the contribution of each misranked pair is independent in the cost accumulation. We will include the explicit closed-form expression for arbitrary costs in the revised manuscript and provide a mathematical verification of the additivity property to confirm the structure holds. revision: yes
Referee: [§4 (Beta family)] Beta-family construction: the one-parameter Beta family is presented as both tunable and grounded in expected search cost, yet the parameter-selection procedure is not detailed. If the choice is post-hoc or data-dependent, it undermines the claim of a parameter-free derivation from first principles and the interpretation as expected cost.

Authors: The Beta family parameter is not selected post-hoc but corresponds directly to the testing cost in the Pandora formulation, providing a tunable balance while remaining grounded. We will revise §4 to include a detailed description of how the parameter is determined from the cost regime, including examples for different cost settings, to clarify that it does not undermine the first-principles derivation. revision: yes

Circularity Check

0 steps flagged

Derivation of Pandora's Regret from expected optimal search cost is independent and non-circular

full rationale

The paper constructs Pandora's Regret by analyzing the expected cost of optimal sequential search under varying testing costs, drawing on the external decision-theoretic framework of Pandora's problem. This provides an independent grounding for the closed-form, pairwise-additive, and strictly proper properties rather than defining the rule in terms of itself or fitting parameters to the target evaluation metric. The one-parameter Beta family arises directly from the construction as a tunable balance between rank-swapping and magnitude penalties while preserving the expected-cost interpretation. No load-bearing self-citations, imported uniqueness theorems, or ansatzes smuggled via prior work are present. The derivation chain remains self-contained against external benchmarks of search-cost minimization and does not reduce to its inputs by construction.

Axiom & Free-Parameter Ledger

1 free parameters · 2 axioms · 0 invented entities

The claim rests on the domain assumption that sequential search admits an optimal policy whose expected cost can be expressed in closed form as a pairwise additive function of the probability vector, plus one tunable parameter in the Beta family.

free parameters (1)

Beta family parameter
Single parameter controlling the relative penalty on rank-swapping versus probability magnitude; introduced to define the family of rules.

axioms (2)

domain assumption Sequential search induces a pairwise structure on the scoring rule
Invoked when analyzing expected cost of optimal search under varying testing costs.
domain assumption Optimal testing order is determined by the model's reported probabilities
Used to derive the regret quantity that becomes the scoring rule.

pith-pipeline@v0.9.0 · 5482 in / 1474 out tokens · 59829 ms · 2026-05-10T15:27:05.357990+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

299 extracted references · 145 canonical work pages

[1]

On the consistency of top-k surrogate losses , url =

Yang, Forest and Koyejo, Sanmi , booktitle =. On the consistency of top-k surrogate losses , url =. 2020 , abstract =

2020
[2]

IEEE Transactions on Pattern Analysis & Machine Intelligence , keywords =

Lapin, Maksim and Hein, Matthias and Schiele, Bernt , date-added =. IEEE Transactions on Pattern Analysis & Machine Intelligence , keywords =. 2018 , abstract =. doi:10.1109/TPAMI.2017.2751607 , issn =

work page doi:10.1109/tpami.2017.2751607 2018
[3]

On the Relationship Between Binary Classification, Bipartite Ranking, and Binary Class Probability Estimation , url =

Narasimhan, Harikrishna and Agarwal, Shivani , booktitle =. On the Relationship Between Binary Classification, Bipartite Ranking, and Binary Class Probability Estimation , url =. 2013 , bdsk-url-1 =

2013
[4]

Scale Calibration of Deep Ranking Models , year =

Le Yan and Zhen Qin and Xuanhui Wang and Mike Bendersky and Marc Najork , date-added =. Scale Calibration of Deep Ranking Models , year =
[5]

Differentiable Ranking and Sorting using Optimal Transport , url =

Cuturi, Marco and Teboul, Olivier and Vert, Jean-Philippe , booktitle =. Differentiable Ranking and Sorting using Optimal Transport , url =. 2019 , bdsk-file-1 =

2019
[6]

Monotonic Differentiable Sorting Networks , url =

Felix Petersen and Christian Borgelt and Hilde Kuehne and Oliver Deussen , booktitle =. Monotonic Differentiable Sorting Networks , url =. 2022 , bdsk-file-1 =

2022
[7]

An Analysis of the Softmax Cross Entropy Loss for Learning-to-Rank with Binary Relevance , url =

Bruch, Sebastian and Wang, Xuanhui and Bendersky, Michael and Najork, Marc , booktitle =. An Analysis of the Softmax Cross Entropy Loss for Learning-to-Rank with Binary Relevance , url =. 2019 , abstract =. doi:10.1145/3341981.3344221 , isbn =

work page doi:10.1145/3341981.3344221 2019
[8]

A Stochastic Treatment of Learning to Rank Scoring Functions , url =

Bruch, Sebastian and Han, Shuguang and Bendersky, Michael and Najork, Marc , booktitle =. A Stochastic Treatment of Learning to Rank Scoring Functions , url =. 2020 , abstract =. doi:10.1145/3336191.3371844 , isbn =

work page doi:10.1145/3336191.3371844 2020
[9]

Multilabel classification with meta-level features in a learning-to-rank framework , url =

Yang, Yiming and Gopal, Siddharth , date =. Multilabel classification with meta-level features in a learning-to-rank framework , url =. Machine Learning , number =. 2012 , abstract =. doi:10.1007/s10994-011-5270-7 , id =

work page doi:10.1007/s10994-011-5270-7 2012
[10]

Rank4Class: A Ranking Formulation for Multiclass Classification , year =

Nan Wang and Zhen Qin and Le Yan and Honglei Zhuang and Xuanhui Wang and Michael Bendersky and Marc Najork , date-added =. Rank4Class: A Ranking Formulation for Multiclass Classification , year =
[11]

Top-K Pairwise Ranking: Bridging the Gap Among Ranking-Based Measures for Multi-label Classification , url =

Wang, Zitai and Xu, Qianqian and Yang, Zhiyong and Wen, Peisong and He, Yuan and Cao, Xiaochun and Huang, Qingming , date =. Top-K Pairwise Ranking: Bridging the Gap Among Ranking-Based Measures for Multi-label Classification , url =. International Journal of Computer Vision , number =. 2025 , abstract =. doi:10.1007/s11263-024-02157-w , id =

work page doi:10.1007/s11263-024-02157-w 2025
[12]

Williamson , date-added =

Aditya Krishna Menon and Robert C. Williamson , date-added =. Bipartite Ranking: a Risk-Theoretic Perspective , url =. Journal of Machine Learning Research , number =. 2016 , bdsk-file-1 =

2016
[13]

Predicting accurate probabilities with a ranking loss

Menon, Aditya Krishna and Jiang, Xiaoqian J and Vembu, Shankar and Elkan, Charles and Ohno-Machado, Lucila , crdt =. Predicting accurate probabilities with a ranking loss. , volume =. Proc Int Conf Mach Learn , jt =. 2012 , abstract =

2012
[14]

Theory and applications of proper scoring rules , url =

Dawid, Alexander Philip and Musio, Monica , date =. Theory and applications of proper scoring rules , url =. METRON , number =. 2014 , abstract =. doi:10.1007/s40300-014-0039-y , id =

work page doi:10.1007/s40300-014-0039-y 2014
[15]

Cost-effective diagnostic test sequencing

Eiseman, B and Jones, R and McClatchey, M and Borlase, B , crdt =. Cost-effective diagnostic test sequencing. , volume =. World J Surg , jt =. 1989 , abstract =. doi:10.1007/BF01659033 , edat =

work page doi:10.1007/bf01659033 1989
[16]

Cost-sensitive classification with cost uncertainty: do we need surrogate losses? , url =

Komisarenko, Viacheslav and Kull, Meelis , date =. Cost-sensitive classification with cost uncertainty: do we need surrogate losses? , url =. Machine Learning , number =. 2025 , abstract =. doi:10.1007/s10994-024-06634-8 , id =

work page doi:10.1007/s10994-024-06634-8 2025
[17]

On Loss Functions and Regret Bounds for Multi-Category Classification , volume =

Tan, Zhiqiang and Zhang, Xinwei , date-added =. On Loss Functions and Regret Bounds for Multi-Category Classification , volume =. 2022 , bdsk-file-1 =. doi:10.1109/TIT.2022.3167635 , journal =

work page doi:10.1109/tit.2022.3167635 2022
[18]

Wightman, Pytorch image mod- els,https://github.com/rwightman/ pytorch-image-models(2019).doi: 10.5281/zenodo.4414861

Ross Wightman , date-added =. PyTorch Image Models , year =. GitHub repository , publisher =. doi:10.5281/zenodo.4414861 , howpublished =

work page doi:10.5281/zenodo.4414861
[19]

Optimal discrete search with imperfect specificity , url =

Moshe Kress and Kyle Lin and Roberto Szechtman , date-added =. Optimal discrete search with imperfect specificity , url =. 2008 , abstract =. doi:10.1007/s00186-007-0197-2 , journal =

work page doi:10.1007/s00186-007-0197-2 2008
[20]

PiRank: scalable learning to rank via differentiable sorting , year =

Swezey, Robin and Grover, Aditya and Charron, Bruno and Ermon, Stefano , booktitle =. PiRank: scalable learning to rank via differentiable sorting , year =
[21]

SoftRank: optimizing non-smooth rank metrics , url =

Taylor, Michael and Guiver, John and Robertson, Stephen and Minka, Tom , booktitle =. SoftRank: optimizing non-smooth rank metrics , url =. 2008 , abstract =. doi:10.1145/1341531.1341544 , isbn =

work page doi:10.1145/1341531.1341544 2008
[22]

Weinberger and Lawrence K

Xia, Fen and Liu, Tie-Yan and Wang, Jue and Zhang, Wensheng and Li, Hang , booktitle =. Listwise approach to learning to rank: theory and algorithm , url =. 2008 , abstract =. doi:10.1145/1390156.1390306 , isbn =

work page doi:10.1145/1390156.1390306 2008
[23]

Optimizing search engines using clickthrough data

Joachims, Thorsten , booktitle =. Optimizing search engines using clickthrough data , url =. 2002 , abstract =. doi:10.1145/775047.775067 , isbn =

work page doi:10.1145/775047.775067 2002
[24]

ArXiv , title =

Przemyslaw Pobrotyn and Radoslaw Bialobrzeski , date-added =. ArXiv , title =. 2021 , bdsk-file-1 =

2021
[25]

From ranknet to lambdarank to lambdamart: An overview , volume =

Burges, Christopher , date-added =. From ranknet to lambdarank to lambdamart: An overview , volume =. Learning , month =. 2010 , bdsk-file-1 =

2010
[26]

LambdaRank Gradients are Incoherent , url =

Marcuzzi, Federico and Lucchese, Claudio and Orlando, Salvatore , booktitle =. LambdaRank Gradients are Incoherent , url =. 2023 , abstract =. doi:10.1145/3583780.3614948 , isbn =

work page doi:10.1145/3583780.3614948 2023
[27]

Advances in neural information processing systems , title =

Burges, Christopher and Ragno, Robert and Le, Quoc , date-added =. Advances in neural information processing systems , title =. 2006 , bdsk-file-1 =

2006
[28]

Learning to rank using gradient descent , url =

Burges, Chris and Shaked, Tal and Renshaw, Erin and Lazier, Ari and Deeds, Matt and Hamilton, Nicole and Hullender, Greg , booktitle =. Learning to rank using gradient descent , url =. 2005 , abstract =. doi:10.1145/1102351.1102363 , isbn =

work page doi:10.1145/1102351.1102363 2005
[29]

2025 , bdsk-file-1 =

Three Types of Calibration with Properties and their Semantic and Formal Relationships , url =. 2025 , bdsk-file-1 =. arXiv , author =:2504.18395 , primaryclass =

work page arXiv 2025
[30]

Frongillo and Jana Hlavinov'a and Birgit Rudloff , date-added =

Tobias Fissler and Rafael M. Frongillo and Jana Hlavinov'a and Birgit Rudloff , date-added =. Electronic Journal of Statistics , title =. 2019 , bdsk-file-1 =

2019
[31]

, date-added =

Painsky, Amichai and Wornell, Gregory W. , date-added =. Bregman Divergence Bounds and Universality Properties of the Logarithmic Loss , volume =. 2020 , bdsk-file-1 =. doi:10.1109/TIT.2019.2958705 , journal =

work page doi:10.1109/tit.2019.2958705 2020
[32]

2025 , bdsk-file-1 =

Asymmetric Penalties Underlie Proper Loss Functions in Probabilistic Forecasting , url =. 2025 , bdsk-file-1 =. arXiv , author =:2505.00937 , primaryclass =

work page arXiv 2025
[33]

Cross-entropy loss functions: theoretical analysis and applications , year =

Mao, Anqi and Mohri, Mehryar and Zhong, Yutao , booktitle =. Cross-entropy loss functions: theoretical analysis and applications , year =
[34]

Robust Classification for Imprecise Environments , url =

Provost, Foster and Fawcett, Tom , date =. Robust Classification for Imprecise Environments , url =. Machine Learning , number =. 2001 , abstract =. doi:10.1023/A:1007601015854 , id =

work page doi:10.1023/a:1007601015854 2001
[35]

Reid and Robert C

Mark D. Reid and Robert C. Williamson , date-added =. Composite Binary Losses , url =. Journal of Machine Learning Research , number =. 2010 , bdsk-file-1 =

2010
[36]

Williamson and Elodie Vernet and Mark D

Robert C. Williamson and Elodie Vernet and Mark D. Reid , date-added =. Composite Multiclass Losses , url =. Journal of Machine Learning Research , number =. 2016 , bdsk-file-1 =

2016
[37]

Elicitability of Instance and Object Ranking , url =

Werner, Tino , date-added =. Elicitability of Instance and Object Ranking , url =. Decision Analysis , keywords =. 2022 , abstract =. doi:10.1287/deca.2021.0446 , issn =

work page doi:10.1287/deca.2021.0446 2022
[38]

Advances in Neural Information Processing Systems , title =

Calauzenes, Cl. Advances in Neural Information Processing Systems , title =. 2012 , bdsk-file-1 =

2012
[39]

Shannon, C. E. , date-added =. A Mathematical Theory of Communication , volume =. 1948 , bdsk-file-1 =. doi:10.1002/j.1538-7305.1948.tb01338.x , journal =

work page doi:10.1002/j.1538-7305.1948.tb01338.x 1948
[40]

and Lusted, Lee B

Ledley, Robert S. and Lusted, Lee B. , date-added =. Reasoning Foundations of Medical Diagnosis , volume =. 1959 , bdsk-file-1 =. doi:10.1126/science.130.3366.9 , journal =

work page doi:10.1126/science.130.3366.9 1959
[41]

Philip , date-added =

Dawid, A. Philip , date-added =. The Well-Calibrated. Journal of the American Statistical Association , number =. 1982 , bdsk-file-1 =

1982
[42]

, booktitle =

Turney, Peter D. , booktitle =. Types of Cost in Inductive Concept Learning , year =
[43]

Learning and Making Decisions When Costs and Probabilities Are Both Unknown , year =

Zadrozny, Bianca and Elkan, Charles , booktitle =. Learning and Making Decisions When Costs and Probabilities Are Both Unknown , year =. doi:10.1145/502512.502540 , pages =

work page doi:10.1145/502512.502540
[44]

Pennock, and Yoav Shoham

Lambert, Nicolas S. and Pennock, David M. and Shoham, Yoav , booktitle =. Eliciting Properties of Probability Distributions , year =. doi:10.1145/1386790.1386813 , pages =

work page doi:10.1145/1386790.1386813
[45]

and Glazebrook, Kevin D

Gittins, John C. and Glazebrook, Kevin D. and Weber, Richard , date-added =. Multi-Armed Bandit Allocation Indices , year =. doi:10.1002/9780470980033 , edition =

work page doi:10.1002/9780470980033
[46]

Sequential analysis

Wald, Abraham , date-added =. Sequential analysis. , year =
[47]

Gittins, J. C. , date-added =. Bandit Processes and Dynamic Allocation Indices , url =. Journal of the Royal Statistical Society: Series B (Methodological) , keywords =. 1979 , abstract =. doi:https://doi.org/10.1111/j.2517-6161.1979.tb01068.x , eprint =

work page doi:10.1111/j.2517-6161.1979.tb01068.x 1979
[48]

Pauker and Jerome P

Stephen G. Pauker and Jerome P. Kassirer , date-added =. The Threshold Approach to Clinical Decision Making , url =. New England Journal of Medicine , number =. 1980 , abstract =. doi:10.1056/NEJM198005153022003 , eprint =

work page doi:10.1056/nejm198005153022003 1980
[49]

Net benefit approaches to the evaluation of prediction models, molecular markers, and diagnostic tests

Vickers, Andrew J and Van Calster, Ben and Steyerberg, Ewout W , cin =. Net benefit approaches to the evaluation of prediction models, molecular markers, and diagnostic tests. , volume =. BMJ , jt =. 2016 , abstract =. doi:10.1136/bmj.i6 , edat =

work page doi:10.1136/bmj.i6 2016
[50]

Net benefit, calibration, threshold selection, and training objectives for algorithmic fairness in healthcare , url =

Pfohl, Stephen and Xu, Yizhe and Foryciarz, Agata and Ignatiadis, Nikolaos and Genkins, Julian and Shah, Nigam , booktitle =. Net benefit, calibration, threshold selection, and training objectives for algorithmic fairness in healthcare , url =. 2022 , abstract =. doi:10.1145/3531146.3533166 , isbn =

work page doi:10.1145/3531146.3533166 2022
[51]

Evaluation of performance measures in predictive artificial intelligence models to support medical decisions: overview and guidance

Van Calster, Ben and Collins, Gary S and Vickers, Andrew J and Wynants, Laure and Kerr, Kathleen F and Barre. Evaluation of performance measures in predictive artificial intelligence models to support medical decisions: overview and guidance. , volume =. Lancet Digit Health , jt =. 2025 , abstract =. doi:10.1016/j.landig.2025.100916 , edat =

work page doi:10.1016/j.landig.2025.100916 2025
[52]

Superior scoring rules for probabilistic evaluation of single-label multi-class classification tasks , url =

Rouhollah Ahmadian and Mehdi Ghatee and Johan Wahlstr. Superior scoring rules for probabilistic evaluation of single-label multi-class classification tasks , url =. International Journal of Approximate Reasoning , keywords =. 2025 , abstract =. doi:https://doi.org/10.1016/j.ijar.2025.109421 , issn =

work page doi:10.1016/j.ijar.2025.109421 2025
[53]

2025 , bdsk-url-1 =

Information-theoretic Generalization Analysis for Expected Calibration Error , url =. 2025 , bdsk-url-1 =. arXiv , author =:2405.15709 , primaryclass =

work page arXiv 2025
[54]

Understanding Model Calibration - A gentle introduction and visual exploration of calibration and the expected calibration error (

Maja Pavlovic , booktitle =. Understanding Model Calibration - A gentle introduction and visual exploration of calibration and the expected calibration error (. 2025 , bdsk-url-1 =

2025
[55]

2024 , bdsk-url-1 =

National overview of. 2024 , bdsk-url-1 =

2024
[56]

Threshold Moving for Online Class Imbalance Learning with Dynamic Evolutionary Cost Vector , url =

Qin, Peijia and Li, Shuxian and Liu, Xiaoqun and Zheng, Zubin and Chong, Siang Yew , code =. Threshold Moving for Online Class Imbalance Learning with Dynamic Evolutionary Cost Vector , url =. Transactions on Machine Learning Research , month =. 2024 , bdsk-url-1 =

2024
[57]

2024 , abstract =

H2 2024 Update: State of Omnichannel Fraud Report: Trends and strategies for protecting organizations and consumers , type =. 2024 , abstract =

2024
[58]

The history of the ROC curve , year =

Huijzer, Rik , date-modified =. The history of the ROC curve , year =
[59]

Notes on

Tilman B. Notes on. 2024 , bdsk-url-1 =

2024
[60]

Cost-sensitive learning for imbalanced medical data: a review , url =

Araf, Imane and Idri, Ali and Chairi, Ikram , date =. Cost-sensitive learning for imbalanced medical data: a review , url =. Artificial Intelligence Review , number =. 2024 , abstract =. doi:10.1007/s10462-023-10652-8 , id =

work page doi:10.1007/s10462-023-10652-8 2024
[61]

2024 , bdsk-url-1 =

Accuracy, Estimates, and Representation Results , url =. 2024 , bdsk-url-1 =. arXiv , author =:2412.06420 , primaryclass =

work page arXiv 2024
[62]

2024 , bdsk-file-1 =

Optimal Scoring Rule Design under Partial Knowledge , url =. 2024 , bdsk-file-1 =. arXiv , author =:2107.07420 , primaryclass =

work page arXiv 2024
[63]

and Maxwell, Aaron E

Farhadpour, Sarah and Warner, Timothy A. and Maxwell, Aaron E. , doi =. Selecting and Interpreting Multiclass Loss and Accuracy Assessment Metrics for Classifications with Class Imbalance: Guidance and Best Practices , url =. Remote Sensing , number =. 2024 , abstract =

2024
[64]

Evaluating Posterior Probabilities: Decision Theory, Proper Scoring Rules, and Calibration , year =

Ferrer, Luciana and Ramos, Daniel , journal =. Evaluating Posterior Probabilities: Decision Theory, Proper Scoring Rules, and Calibration , year =
[65]

Understanding subgroup performance differences of fair predictors using causal models , url =

Stephen Robert Pfohl and Natalie Harris and Chirag Nagpal and David Madras and Vishwali Mhasawade and Olawale Elijah Salaudeen and Katherine A Heller and Sanmi Koyejo and Alexander Nicholas D'Amour , booktitle =. Understanding subgroup performance differences of fair predictors using causal models , url =. 2024 , bdsk-file-1 =

2024
[66]

2024 , abstract =

Weighted Brier Score -- an Overall Summary Measure for Risk Prediction Models with Clinical Utility Consideration , url =. 2024 , abstract =. arXiv , author =:2408.01626 , primaryclass =

work page arXiv 2024
[67]

Jordan and Peter Vogel , doi =

Timo Dimitriadis and Tilmann Gneiting and Alexander I. Jordan and Peter Vogel , doi =. Evaluating probabilistic classifiers: The triptych , url =. International Journal of Forecasting , keywords =. 2024 , abstract =

2024
[68]

McDermott and Haoran Zhang and Lasse Hyldig Hansen and Giovanni Angelotti and Jack Gallifant , booktitle =

Matthew B.A. McDermott and Haoran Zhang and Lasse Hyldig Hansen and Giovanni Angelotti and Jack Gallifant , booktitle =. A Closer Look at. 2024 , abstract =

2024
[69]

Foody, Giles M. , doi =. Challenges in the real world use of classification accuracy metrics: From recall and precision to the Matthews correlation coefficient , url =. PLOS ONE , month =. 2023 , abstract =

2023
[70]

Online Harmonizing Gradient Descent for Imbalanced Data Streams One-Pass Classification , url =

Zhou, Han and Yin, Hongpeng and Deng, Xuanhong and Huang, Yuyu , booktitle =. Online Harmonizing Gradient Descent for Imbalanced Data Streams One-Pass Classification , url =. 2023 , bdsk-url-1 =. doi:10.24963/ijcai.2023/274 , editor =

work page doi:10.24963/ijcai.2023/274 2023
[71]

Improving fairness in AI models on electronic health records: the case for federated learning methods

Kwegyir-Aggrey, Kweku and Gerchick, Marissa and Mohan, Malika and Horowitz, Aaron and Venkatasubramanian, Suresh , booktitle =. The Misuse of AUC: What High Impact Risk Assessment Gets Wrong , url =. 2023 , abstract =. doi:10.1145/3593013.3594100 , location =

work page doi:10.1145/3593013.3594100 2023
[72]

Deep ROC Analysis and AUC as Balanced Average Accuracy, for Improved Classifier Selection, Audit and Explanation

Carrington, Andre M and Manuel, Douglas G and Fieguth, Paul W and Ramsay, Tim and Osmani, Venet and Wernly, Bernhard and Bennett, Carol and Hawken, Steven and Magwood, Olivia and Sheikh, Yusuf and McInnes, Matthew and Holzinger, Andreas , crdt =. Deep ROC Analysis and AUC as Balanced Average Accuracy, for Improved Classifier Selection, Audit and Explanati...

work page doi:10.1109/tpami.2022.3145392 2023
[73]

A Comparative Study of Assessment Metrics for Imbalanced Learning , year =

Farou, Zakarya and Aharrat, Mohamed and Horv. A Comparative Study of Assessment Metrics for Imbalanced Learning , year =. New Trends in Database and Information Systems , date =
[74]

Hand, D. J. and Anagnostopoulos, C. , date =. Notes on the H-measure of classifier performance , url =. Advances in Data Analysis and Classification , number =. 2023 , abstract =. doi:10.1007/s11634-021-00490-3 , id =

work page doi:10.1007/s11634-021-00490-3 2023
[75]

MedMNIST v2-A large-scale lightweight benchmark for 2D and 3D biomedical image classification , volume =

Yang, Jiancheng and Shi, Rui and Wei, Donglai and Liu, Zequan and Zhao, Lin and Ke, Bilian and Pfister, Hanspeter and Ni, Bingbing , journal =. MedMNIST v2-A large-scale lightweight benchmark for 2D and 3D biomedical image classification , volume =
[76]

AdaCC: cumulative cost-sensitive boosting for imbalanced classification , url =

Iosifidis, Vasileios and Papadopoulos, Symeon and Rosenhahn, Bodo and Ntoutsi, Eirini , date =. AdaCC: cumulative cost-sensitive boosting for imbalanced classification , url =. Knowledge and Information Systems , number =. 2023 , abstract =. doi:10.1007/s10115-022-01780-8 , id =

work page doi:10.1007/s10115-022-01780-8 2023
[77]

From classification accuracy to proper scoring rules: elicitability of probabilistic top list predictions , volume =

Resin, Johannes , issn =. From classification accuracy to proper scoring rules: elicitability of probabilistic top list predictions , volume =. J. Mach. Learn. Res. , keywords =. 2023 , abstract =

2023
[78]

and Cranko, Zac , issn =

Williamson, Robert C. and Cranko, Zac , issn =. The geometry and calculus of losses , volume =. J. Mach. Learn. Res. , keywords =. 2023 , abstract =

2023
[79]

Ferrer, Analysis and comparison of classification met- rics, arXiv preprint arXiv:2209.05355 (2022)

Ferrer, Luciana , date-added =. Analysis and Comparison of Classification Metrics , year =. doi:10.48550/arXiv.2209.05355 , month =

work page doi:10.48550/arxiv.2209.05355
[80]

Frameworks and Results in Distributionally Robust Optimization , url =

Rahimian, Hamed and Mehrotra, Sanjay , doi =. Frameworks and Results in Distributionally Robust Optimization , url =. Open Journal of Mathematical Optimization , month = jul, pages =. 2022 , bdsk-url-1 =

2022

Showing first 80 references.