Density-Ratio Losses for Post-Hoc Learning to Defer

Alexander Soen; Joakim Jald\'en; Ragnar Thobaben; Richard Nock

arxiv: 2605.19557 · v1 · pith:BA2UWQGUnew · submitted 2026-05-19 · 📊 stat.ML · cs.LG

Density-Ratio Losses for Post-Hoc Learning to Defer

Alexander Soen , Ragnar Thobaben , Joakim Jald\'en , Richard Nock This is my paper

Pith reviewed 2026-05-20 02:38 UTC · model grok-4.3

classification 📊 stat.ML cs.LG

keywords learning to deferdensity ratio estimationpost-hoc methodsideal distributionsclass probability estimationChow's ruleexpert-tilted posterior

0 comments

The pith

Post-hoc learning to defer reduces to estimating the density ratio between a model's and an expert's ideal distributions.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that deferral after a model is already trained can be handled by learning a scorer that estimates the density ratio between the ideal distribution for the model and the ideal distribution for the expert. Ideal distributions are reweighted versions of the data under which the model or expert attains low loss through divergence regularization. By reducing density-ratio estimation to class-probability estimation, the authors derive specific loss functions whose outputs can be thresholded to decide when to defer. A sympathetic reader would care because this approach permits changing the deferral rate with a simple threshold adjustment after training, recovers classical optimal rules such as Chow's for KL-based cases, and links deferral to an expert-tilted Bayes posterior that accounts for the expert's performance.

Core claim

What carries the argument

The density ratio between a model's ideal distribution and an expert's ideal distribution, obtained via divergence-regularized reweightings and reduced to class-probability estimation losses for training a post-hoc deferral scorer.

Load-bearing premise

The reduction from density-ratio estimation to class-probability estimation holds for the chosen ideal distributions and the resulting scorer can be thresholded to produce valid deferral decisions without additional calibration or assumptions on the joint distribution of model and expert errors.

What would settle it

An experiment in which the DR CPE scorer's thresholded outputs fail to recover Chow's rule under the original distribution or fail to match the expert-tilted Bayes posterior for KL-based ideal distributions would falsify the central derivation.

Figures

Figures reproduced from arXiv: 2605.19557 by Alexander Soen, Joakim Jald\'en, Ragnar Thobaben, Richard Nock.

**Figure 2.** Figure 2: Accuracy-deferral trade-off. Titles detail the dataset and ResNet utilized, [PITH_FULL_IMAGE:figures/full_fig_p010_2.png] view at source ↗

**Figure 3.** Figure 3: Accuracy-deferral trade-off plots over different [PITH_FULL_IMAGE:figures/full_fig_p035_3.png] view at source ↗

**Figure 4.** Figure 4: Accuracy-deferral trade-off plots over both accuracy and top- [PITH_FULL_IMAGE:figures/full_fig_p036_4.png] view at source ↗

**Figure 5.** Figure 5: Accuracy-deferral trade-off plots over different DR partial losses (with fixed [PITH_FULL_IMAGE:figures/full_fig_p038_5.png] view at source ↗

read the original abstract

We study post-hoc Learning to Defer (L2D) through the lens of ideal distributions: divergence-regularized reweightings of the data distribution under which a model attains low loss. We define deferral via the density-ratio between a model's and an expert's ideals. Using the reduction from density-ratio estimation to class-probability estimation, we derive the DR CPE losses for post-hoc L2D scorers. Deferral decisions are then made by thresholding the scorer, allowing deferral rates to be adjusted without retraining. For KL-based ideal distributions, our deferral rules recovers Chow's rule under the original distribution and a connection to an expert-tilted Bayes posterior -- which incorporates the expert's performance -- depending on if the ideal distributions are joint or marginal distributions. Experimentally, our approach is competitive compared to common baselines and more robust across dataset settings. More broadly, our results cast post-hoc L2D as density-ratio learning between ideal distributions, bridging Chow-style rules, expert comparison, and elucidating connections to related learning settings including anomaly detection.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This paper frames post-hoc deferral as density-ratio estimation between divergence-regularized ideal distributions and derives CPE losses that recover Chow's rule in the KL case.

read the letter

The main thing to know is that the authors treat post-hoc learning to defer as density-ratio learning between two ideal distributions, one reweighted for the model and one for the expert. They apply the standard density-ratio to class-probability reduction to get new losses, then threshold the resulting scorer to control deferral rate without retraining. For KL ideals this recovers Chow's rule on the original distribution and links to an expert-tilted Bayes posterior depending on whether the ideals are joint or marginal.

Referee Report

2 major / 2 minor

Summary. The paper proposes framing post-hoc learning to defer (L2D) as density-ratio estimation between 'ideal' distributions, defined as divergence-regularized reweightings of the original data measure under which a model or expert attains low loss. It derives DR-CPE losses by reducing density-ratio estimation to class-probability estimation, obtains deferral scorers that can be thresholded to control deferral rate without retraining, and shows that the KL-based case recovers Chow's rule (under joint ideals) or connects to an expert-tilted Bayes posterior (under marginal ideals). Experiments are reported as competitive with baselines and more robust across settings.

Significance. If the central reduction and thresholding argument hold without additional joint-error assumptions, the work supplies a clean density-ratio perspective that unifies Chow-style rules with post-hoc L2D and links to anomaly detection. The post-hoc, thresholdable nature is practically attractive. The manuscript does not yet provide quantitative tables, error bars, or dataset details in the abstract, so the strength of the empirical claim remains to be verified from the full experiments.

major comments (2)

[§3] §3 (or the derivation following Eq. (3)): the claim that thresholding the DR-CPE scorer produces valid deferral sets appears to rest on the assumption that the class probability is monotonic in the conditional advantage of the expert over the model. The skeptic note correctly flags that marginal ideal distributions alone do not encode instance-level error correlation; the manuscript must explicitly state whether the joint versus marginal choice of ideals is sufficient to guarantee monotonicity, or whether an extra assumption on the joint distribution of model/expert errors is required.
[§4] The recovery of Chow's rule for KL-based joint ideals is stated in the abstract and presumably derived in §4. The derivation must be checked for circularity: if the reweighting factors that define the ideals are realized by a loss on the original data, the resulting ratio estimator must not implicitly presuppose the very deferral decision it is meant to produce. A concrete walk-through of the steps from the ideal densities to the thresholded rule would clarify this.

minor comments (2)

[Abstract] The abstract asserts that experiments are 'competitive and more robust' yet supplies no quantitative results, error bars, or dataset names. These details belong in the abstract or a prominent table in §5.
[Abstract] Notation for the ideal distributions (joint vs. marginal) should be introduced once and used consistently; the current abstract switches between them without a clear forward pointer to the section that defines both.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments and the recommendation of major revision. We address each major comment point by point below and will revise the manuscript to incorporate the requested clarifications.

read point-by-point responses

Referee: [§3] §3 (or the derivation following Eq. (3)): the claim that thresholding the DR-CPE scorer produces valid deferral sets appears to rest on the assumption that the class probability is monotonic in the conditional advantage of the expert over the model. The skeptic note correctly flags that marginal ideal distributions alone do not encode instance-level error correlation; the manuscript must explicitly state whether the joint versus marginal choice of ideals is sufficient to guarantee monotonicity, or whether an extra assumption on the joint distribution of model/expert errors is required.

Authors: We agree that explicit clarification is needed on this point. When joint ideal distributions are used, the density ratio is taken with respect to the joint measure over instances, which directly incorporates the per-instance losses of both the model and the expert. Consequently the class probability obtained from the DR-CPE reduction is monotonic in the conditional advantage of the expert, so that thresholding yields valid deferral sets without further assumptions. When marginal ideal distributions are used, instance-level error correlations are not encoded and monotonicity would indeed require an additional assumption on the joint distribution of model/expert errors. We will revise §3 to state this distinction clearly and to specify the conditions under which the thresholding argument holds. revision: yes
Referee: [§4] The recovery of Chow's rule for KL-based joint ideals is stated in the abstract and presumably derived in §4. The derivation must be checked for circularity: if the reweighting factors that define the ideals are realized by a loss on the original data, the resulting ratio estimator must not implicitly presuppose the very deferral decision it is meant to produce. A concrete walk-through of the steps from the ideal densities to the thresholded rule would clarify this.

Authors: We thank the referee for raising the possibility of circularity. The ideal distributions are purely theoretical objects: divergence-regularized reweightings of the original measure under which the model or expert attains low loss. The density-ratio estimator is obtained by applying the DR-CPE reduction directly to samples drawn from the original data distribution; the estimator is trained using only the observed per-instance losses of the model and expert and does not involve any deferral decisions or thresholds. The subsequent thresholding step is applied after estimation and does not feed back into the training of the scorer. We will add a concise, numbered walk-through of the steps from the definition of the ideal densities through the DR-CPE reduction to the final thresholded rule in the revised §4. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation uses external definitions and standard reductions

full rationale

The paper starts from the external notion of ideal distributions (divergence-regularized reweightings of the data measure) and defines deferral explicitly as the density ratio between model and expert ideals. It then invokes the known, independently established reduction from density-ratio estimation to class-probability estimation to obtain the DR-CPE losses. Thresholding the resulting scorer is presented as a direct consequence of this construction. For the KL case the paper shows recovery of Chow's rule under the original distribution, which is an external benchmark rather than a self-derived quantity. No equation or step equates a fitted parameter to a prediction by construction, and no load-bearing premise rests solely on self-citation. The central claim therefore remains independent of its own outputs.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on the existence of well-defined ideal distributions for both model and expert, the validity of the density-ratio-to-CPE reduction, and the assumption that thresholding the resulting scorer yields useful deferral without further calibration. No explicit free parameters or invented entities are named in the abstract.

axioms (2)

domain assumption Ideal distributions exist and can be used to define a meaningful density ratio for deferral decisions.
Invoked when deferral is defined via the density-ratio between model's and expert's ideals.
standard math The standard reduction from density-ratio estimation to class-probability estimation applies directly to the chosen ideal distributions.
Used to derive the DR CPE losses.

pith-pipeline@v0.9.0 · 5724 in / 1680 out tokens · 37157 ms · 2026-05-20T02:38:52.614854+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel echoes

?

echoes
ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

ideal distribution Q_m ∈ arg min_Q E[ℓ] + γ D(Q∥P_x); KL yields w(x;γ) = Z^{-1} exp(−E_η[ℓ]/γ)
IndisputableMonolith/Foundation/AlphaCoordinateFixation.lean J_uniquely_calibrated_via_higher_derivative unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

deferral ri(x) = J ρ_i(x;γ,γ^e) ≤ τ K with ρ = dQi/dQ^e_i

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

88 extracted references · 88 canonical work pages · 3 internal anchors

[1]

Methods of information geometry , volume 191

Shun-ichi Amari and Hiroshi Nagaoka. Methods of information geometry , volume 191. American Mathematical Soc., 2000

work page 2000
[2]

Classification with a Reject Option using a Hinge Loss

Peter L Bartlett and Marten H Wegkamp. Classification with a Reject Option using a Hinge Loss. Journal of Machine Learning Research, 9 0 (8), 2008

work page 2008
[3]

Discriminative learning under covariate shift

Steffen Bickel, Michael Br \"u ckner, and Tobias Scheffer. Discriminative learning under covariate shift. Journal of Machine Learning Research, 10 0 (9), 2009

work page 2009
[4]

Loss functions for binary class probability estimation and classification: Structure and applications

Andreas Buja, Werner Stuetzle, and Yi Shen. Loss functions for binary class probability estimation and classification: Structure and applications . Working draft, November, 3: 0 13, 2005

work page 2005
[5]

How the machine `thinks': Understanding opacity in machine learning algorithms

Jenna Burrell. How the machine `thinks': Understanding opacity in machine learning algorithms . Big data & society, 3 0 (1): 0 2053951715622512, 2016

work page 2016
[6]

Generalizing consistent multi-class classification with rejection to be compatible with arbitrary losses

Yuzhou Cao, Tianchi Cai, Lei Feng, Lihong Gu, Jinjie Gu, Bo An, Gang Niu, and Masashi Sugiyama. Generalizing consistent multi-class classification with rejection to be compatible with arbitrary losses . Advances in neural information processing systems, 35: 0 521--534, 2022

work page 2022
[7]

Anomaly detection: A survey

Varun Chandola, Arindam Banerjee, and Vipin Kumar. Anomaly detection: A survey . ACM computing surveys (CSUR), 41 0 (3): 0 1--58, 2009

work page 2009
[8]

Classification with Rejection Based on Cost-sensitive Classification

Nontawat Charoenphakdee, Zhenghang Cui, Yivan Zhang, and Masashi Sugiyama. Classification with Rejection Based on Cost-sensitive Classification . In Marina Meila and Tong Zhang, editors, Proceedings of the 38th International Conference on Machine Learning, volume 139 of Proceedings of Machine Learning Research, pages 1507--1517. PMLR, 18--24 Jul 2021. URL...

work page 2021
[9]

A unifying post-processing framework for multi-objective learn-to-defer problems

Mohammad-Amin Charusaie and Samira Samadi. A unifying post-processing framework for multi-objective learn-to-defer problems . Advances in Neural Information Processing Systems, 37: 0 23705--23755, 2024

work page 2024
[10]

C. Chow. On optimum recognition error and reject tradeoff . IEEE Transactions on Information Theory, 16: 0 41--46, 1970. doi:10.1109/TIT.1970.1054406

work page doi:10.1109/tit.1970.1054406 1970
[11]

An optimum character recognition system using decision functions

Chi-Keung Chow. An optimum character recognition system using decision functions . IRE Transactions on Electronic Computers, EC-6 0 (4): 0 247--254, 1957. doi:10.1109/TEC.1957.5222035

work page doi:10.1109/tec.1957.5222035 1957
[12]

Learning with rejection

Corinna Cortes, Giulia DeSalvo, and Mehryar Mohri. Learning with rejection . In International conference on algorithmic learning theory, pages 67--82. Springer, 2016

work page 2016
[13]

arXiv preprint arXiv:2510.26706 , year=

Giulia DeSalvo, Clara Mohri, Mehryar Mohri, and Yutao Zhong. Budgeted multiple-expert deferral . arXiv preprint arXiv:2510.26706, 2025

work page arXiv 2025
[14]

Dohan, W

David Dohan, Winnie Xu, Aitor Lewkowycz, Jacob Austin, David Bieber, Raphael Gontijo Lopes, Yuhuai Wu, Henryk Michalewski, Rif A Saurous, Jascha Sohl-Dickstein, et al. Language model cascades . arXiv preprint arXiv:2207.10342, 2022

work page arXiv 2022
[15]

Statistics of robust optimization: A generalized empirical likelihood approach

John C Duchi, Peter W Glynn, and Hongseok Namkoong. Statistics of robust optimization: A generalized empirical likelihood approach . Mathematics of Operations Research, 46 0 (3): 0 946--969, 2021

work page 2021
[16]

A framework for robustness certification of smoothed classifiers using f-divergences

Krishnamurthy Dj Dvijotham, Jamie Hayes, Borja Balle, Zico Kolter, Chongli Qin, Andras Gyorgy, Kai Xiao, Sven Gowal, and Pushmeet Kohli. A framework for robustness certification of smoothed classifiers using f-divergences . In International Conference on Learning Representations, 2020

work page 2020
[17]

On the Foundations of Noise-free Selective Classification

Ran El-Yaniv et al. On the Foundations of Noise-free Selective Classification. Journal of Machine Learning Research, 11 0 (5), 2010

work page 2010
[18]

On the probability function in the collective theory of risk

F Escher. On the probability function in the collective theory of risk . Skand. Aktuarie Tidskr., 15: 0 175--195, 1932

work page 1932
[19]

Dermatologist-level classification of skin cancer with deep neural networks

Andre Esteva, Brett Kuprel, Roberto A Novoa, Justin Ko, Susan M Swetter, Helen M Blau, and Sebastian Thrun. Dermatologist-level classification of skin cancer with deep neural networks . nature, 542 0 (7639): 0 115--118, 2017

work page 2017
[20]

Optimal strategies for reject option classifiers

Vojtech Franc, Daniel Prusa, and Vaclav Voracek. Optimal strategies for reject option classifiers . Journal of Machine Learning Research, 24 0 (11): 0 1--49, 2023

work page 2023
[21]

Selective classification for deep neural networks

Yonatan Geifman and Ran El-Yaniv. Selective classification for deep neural networks . Advances in neural information processing systems, 30, 2017

work page 2017
[22]

S elective N et: A Deep Neural Network with an Integrated Reject Option

Yonatan Geifman and Ran El-Yaniv. S elective N et: A Deep Neural Network with an Integrated Reject Option . In Kamalika Chaudhuri and Ruslan Salakhutdinov, editors, Proceedings of the 36th International Conference on Machine Learning, volume 97 of Proceedings of Machine Learning Research, pages 2151--2159. PMLR, 09--15 Jun 2019. URL https://proceedings.ml...

work page 2019
[23]

Language Model Cascades: Token-Level Uncertainty And Beyond

Neha Gupta, Harikrishna Narasimhan, Wittawat Jitkrittum, Ankit Singh Rawat, Aditya Krishna Menon, and Sanjiv Kumar. Language Model Cascades: Token-Level Uncertainty And Beyond . In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=KgaBScZ4VI

work page 2024
[24]

Classification with reject option

Radu Herbei and Marten H Wegkamp. Classification with reject option . The Canadian Journal of Statistics/La Revue Canadienne de Statistique, pages 709--721, 2006

work page 2006
[25]

Aleatoric and epistemic uncertainty in machine learning: An introduction to concepts and methods

Eyke H \"u llermeier and Willem Waegeman. Aleatoric and epistemic uncertainty in machine learning: An introduction to concepts and methods . Machine learning, 110 0 (3): 0 457--506, 2021

work page 2021
[26]

When does confidence-based cascade deferral suffice? Advances in Neural Information Processing Systems, 36: 0 9891--9906, 2023

Wittawat Jitkrittum, Neha Gupta, Aditya K Menon, Harikrishna Narasimhan, Ankit Rawat, and Sanjiv Kumar. When does confidence-based cascade deferral suffice? Advances in Neural Information Processing Systems, 36: 0 9891--9906, 2023

work page 2023
[27]

Skin cancer detection: Applying a deep learning based model driven architecture in the cloud for classifying dermal cell images

Mohammad Ali Kadampur and Sulaiman Al Riyaee. Skin cancer detection: Applying a deep learning based model driven architecture in the cloud for classifying dermal cell images . Informatics in Medicine Unlocked, 18: 0 100282, 2020

work page 2020
[28]

Efficient edge inference by selective query

Anil Kag, Igor Fedorov, Aditya Gangrade, Paul Whatmough, and Venkatesh Saligrama. Efficient edge inference by selective query . In The Eleventh International Conference on Learning Representations, 2022

work page 2022
[29]

A least-squares approach to direct importance estimation

Takafumi Kanamori, Shohei Hido, and Masashi Sugiyama. A least-squares approach to direct importance estimation. The Journal of Machine Learning Research, 10: 0 1391--1445, 2009

work page 2009
[30]

BabyBear: Cheap inference triage for expensive language models

Leila Khalili, Yao You, and John Bohannon. BabyBear: Cheap inference triage for expensive language models . arXiv preprint arXiv:2205.11747, 2022. URL https://arxiv.org/abs/2205.11747

work page arXiv 2022
[31]

Adam: A Method for Stochastic Optimization

Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization . arXiv preprint arXiv:1412.6980, 2014

work page internal anchor Pith review Pith/arXiv arXiv 2014
[32]

An optimization-centric view on Bayes' rule: Reviewing and generalizing variational inference

Jeremias Knoblauch, Jack Jewson, and Theodoros Damoulas. An optimization-centric view on Bayes' rule: Reviewing and generalizing variational inference . Journal of Machine Learning Research, 23 0 (132): 0 1--109, 2022

work page 2022
[33]

Two Notes on Notation

Donald E Knuth. Two Notes on Notation . The American Mathematical Monthly, 99: 0 403--422, 1992

work page 1992
[34]

Learning multiple layers of features from tiny images

Alex Krizhevsky. Learning multiple layers of features from tiny images . Technical report, 2009

work page 2009
[35]

E. L. Lehmann and Joseph P. Romano. Testing Statistical Hypotheses . Springer International Publishing, 2005

work page 2005
[36]

Large language models in finance: A survey

Yinheng Li, Shaofei Wang, Han Ding, and Hang Chen. Large language models in finance: A survey . In Proceedings of the fourth ACM international conference on AI in finance, pages 374--382, 2023

work page 2023
[37]

The Inductive Bias of Restricted f-GANs

Shuang Liu and Kamalika Chaudhuri. The inductive bias of restricted f-gans . arXiv preprint arXiv:1809.04542, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[38]

When more experts hurt: Underfitting in multi-expert learning to defer

Shuqi Liu, Yuzhou Cao, Lei Feng, Bo An, and Luke Ong. When More Experts Hurt: Underfitting in Multi-Expert Learning to Defer . arXiv preprint arXiv:2602.17144, 2026

work page arXiv 2026
[39]

Segment anything in medical images

Jun Ma, Yuting He, Feifei Li, Lin Han, Chenyu You, and Bo Wang. Segment anything in medical images . Nature communications, 15 0 (1): 0 654, 2024

work page 2024
[40]

Predict responsibly: improving fairness and accuracy by learning to defer

David Madras, Toni Pitassi, and Richard Zemel. Predict responsibly: improving fairness and accuracy by learning to defer . Advances in neural information processing systems, 31, 2018

work page 2018
[41]

Tangobert: Reducing inference cost by using cascaded architecture

Jonathan Mamou, Oren Pereg, Moshe Wasserblat, and Roy Schwartz. Tangobert: Reducing inference cost by using cascaded architecture . arXiv preprint arXiv:2204.06271, 2022

work page arXiv 2022
[42]

Two-Stage Learning to Defer with Multiple Experts

Anqi Mao, Christopher Mohri, Mehryar Mohri, and Yutao Zhong. Two-Stage Learning to Defer with Multiple Experts . Advances in Neural Information Processing Systems, 36: 0 3578--3606, 2023

work page 2023
[43]

Predictor-rejector multi-class abstention: Theoretical analysis and algorithms

Anqi Mao, Mehryar Mohri, and Yutao Zhong. Predictor-rejector multi-class abstention: Theoretical analysis and algorithms . In International Conference on Algorithmic Learning Theory, pages 822--867. PMLR, 2024 a

work page 2024
[44]

Principled approaches for learning to defer with multiple experts

Anqi Mao, Mehryar Mohri, and Yutao Zhong. Principled approaches for learning to defer with multiple experts . In International Workshop on Combinatorial Image Analysis, pages 107--135. Springer, 2024 b

work page 2024
[45]

Mastering Multiple-Expert Routing: Realizable \ H\ -Consistency and Strong Guarantees for Learning to Defer

Anqi Mao, Mehryar Mohri, and Yutao Zhong. Mastering Multiple-Expert Routing: Realizable \ H\ -Consistency and Strong Guarantees for Learning to Defer . In Forty-second International Conference on Machine Learning, 2025. URL https://openreview.net/forum?id=2KlxjR6lsd

work page 2025
[46]

Linking losses for density ratio and class-probability estimation

Aditya Krishna Menon and Cheng Soon Ong. Linking losses for density ratio and class-probability estimation . In International Conference on Machine Learning, pages 304--313, 2016

work page 2016
[47]

A loss framework for calibrated anomaly detection

Aditya Krishna Menon and Robert C Williamson. A loss framework for calibrated anomaly detection . In Proceedings of the 32nd international conference on neural information processing systems, pages 1494--1504, 2018

work page 2018
[48]

Feynman-Kac Formulae

Pierre Del Moral. Feynman-Kac Formulae . Springer, 2004

work page 2004
[49]

Consistent estimators for learning to defer to an expert

Hussein Mozannar and David Sontag. Consistent estimators for learning to defer to an expert . In International conference on machine learning, pages 7076--7087. PMLR, 2020

work page 2020
[50]

Who should predict? exact algorithms for learning to defer to humans

Hussein Mozannar, Hunter Lang, Dennis Wei, Prasanna Sattigeri, Subhro Das, and David Sontag. Who should predict? exact algorithms for learning to defer to humans . In International conference on artificial intelligence and statistics, pages 10520--10545. PMLR, 2023

work page 2023
[51]

Learning to reject meets long-tail learning

Harikrishna Narasimhan, Aditya Krishna Menon, Wittawat Jitkrittum, Neha Gupta, and Sanjiv Kumar. Learning to reject meets long-tail learning . In The Twelfth International Conference on Learning Representations, 2024

work page 2024
[52]

Faster Cascades via Speculative Decoding

Harikrishna Narasimhan, Wittawat Jitkrittum, Ankit Singh Rawat, Seungyeon Kim, Neha Gupta, Aditya Krishna Menon, and Sanjiv Kumar. Faster Cascades via Speculative Decoding . In The Thirteenth International Conference on Learning Representations, 2025. URL https://openreview.net/forum?id=vo9t20wsmd

work page 2025
[53]

Jerzy Neyman and Egon S. Pearson. IX. On the problem of the most efficient tests of statistical hypotheses . Philosophical Transactions of the Royal Society of London Series A Containing Papers of a Mathematical or Physical Character, 231: 0 289--337, 1933. doi:10.1098/rsta.1933.0009

work page doi:10.1098/rsta.1933.0009 1933
[54]

On the calibration of multiclass classification with rejection

Chenri Ni, Nontawat Charoenphakdee, Junya Honda, and Masashi Sugiyama. On the calibration of multiclass classification with rejection . Advances in neural information processing systems, 32, 2019

work page 2019
[55]

A scaled Bregman theorem with applications

Richard Nock, Aditya Menon, and Cheng Soon Ong. A scaled Bregman theorem with applications . In Advances in Neural Information Processing Systems, volume 29. Curran Associates, Inc., 2016 a

work page 2016
[56]

A scaled Bregman theorem with applications

Richard Nock, Aditya Menon, and Cheng Soon Ong. A scaled Bregman theorem with applications . Advances in Neural Information Processing Systems, 29, 2016 b

work page 2016
[57]

Differentiable learning under triage

Nastaran Okati, Abir De, and Manuel Gomez-Rodriguez. Differentiable learning under triage . Advances in Neural Information Processing Systems, 34: 0 9140--9151, 2021

work page 2021
[58]

Change of measure through the Legendre transform

Antoine Picard-Weibel and Benjamin Guedj. On change of measure inequalities for f -divergences . arXiv preprint arXiv:2202.05568, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[59]

AUC-based Selective Classification

Andrea Pugnana and Salvatore Ruggieri. AUC-based Selective Classification . In Francisco Ruiz, Jennifer Dy, and Jan-Willem van de Meent, editors, Proceedings of The 26th International Conference on Artificial Intelligence and Statistics, volume 206 of Proceedings of Machine Learning Research, pages 2494--2514. PMLR, 25--27 Apr 2023 a . URL https://proceed...

work page 2023
[60]

A Model-Agnostic Heuristics for Selective Classification

Andrea Pugnana and Salvatore Ruggieri. A Model-Agnostic Heuristics for Selective Classification . Proceedings of the AAAI Conference on Artificial Intelligence, 37 0 (8): 0 9461--9469, Jun. 2023 b . doi:10.1609/aaai.v37i8.26133. URL https://ojs.aaai.org/index.php/AAAI/article/view/26133

work page doi:10.1609/aaai.v37i8.26133 2023
[61]

Consistent algorithms for multiclass classification with an abstain option

Harish G Ramaswamy, Ambuj Tewari, and Shivani Agarwal. Consistent algorithms for multiclass classification with an abstain option . Electronic Journal of Statistics, 12: 0 530--554, 2018

work page 2018
[62]

Reid and Robert C

Mark D. Reid and Robert C. Williamson. Composite Binary Losses . Journal of Machine Learning Research, 11: 0 2387--2422, 2010

work page 2010
[63]

Information, Divergence and Risk for Binary Experiments

Mark D Reid and Robert C Williamson. Information, Divergence and Risk for Binary Experiments . Journal of Machine Learning Research, 12: 0 731--817, 2011

work page 2011
[64]

Pattern recognition and neural networks

Brian D Ripley. Pattern recognition and neural networks . Cambridge university press, 2007

work page 2007
[65]

R. T. Rockafellar. Convex Analysis . Princeton University Press, 1970

work page 1970
[66]

Loss Functions and Operators Generated by f-Divergences

Vincent Roulet, Tianlin Liu, Nino Vieillard, Michael Eli Sander, and Mathieu Blondel. Loss Functions and Operators Generated by f-Divergences . In Forty-second International Conference on Machine Learning, 2025. URL https://openreview.net/forum?id=V1YfPJDliw

work page 2025
[67]

A min-max solution of an inventory problem

Herbert E Scarf, KJ Arrow, and S Karlin. A min-max solution of an inventory problem . Technical report, Rand Corporation Santa Monica, 1957

work page 1957
[68]

Toward expert-level medical question answering with large language models

Karan Singhal, Tao Tu, Juraj Gottweis, Rory Sayres, Ellery Wulczyn, Mohamed Amin, Le Hou, Kevin Clark, Stephen R Pfohl, Heather Cole-Lewis, et al. Toward expert-level medical question answering with large language models . Nature medicine, 31 0 (3): 0 943--950, 2025

work page 2025
[69]

A Connection Between Learning to Reject and Bhattacharyya Divergences

Alexander Soen. A Connection Between Learning to Reject and Bhattacharyya Divergences . In Geometric Science of Information, pages 369--377. Springer Nature Switzerland, 2026. doi:10.1007/978-3-032-03918-7_38

work page doi:10.1007/978-3-032-03918-7_38 2026
[70]

Rejection via Learning Density Ratios

Alexander Soen, Hisham Husain, Philip Schulz, and Vu Nguyen. Rejection via Learning Density Ratios . In The Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024

work page 2024
[71]

A Classification Framework for Anomaly Detection

Ingo Steinwart, Don Hush, and Clint Scovel. A Classification Framework for Anomaly Detection. Journal of Machine Learning Research, 6 0 (2), 2005

work page 2005
[72]

Sugiyama, T

Masashi Sugiyama, Taiji Suzuki, Shinichi Nakajima, Hisashi Kashima, Paul von B\" u nau, and Motoaki Kawanabe. Direct importance estimation for covariate shift adaptation . Annals of the Institute of Statistical Mathematics, 60: 0 699--746, 2008. doi:10.1007/s10463-008-0197-x

work page doi:10.1007/s10463-008-0197-x 2008
[73]

Density ratio estimation in machine learning

Masashi Sugiyama, Taiji Suzuki, and Takafumi Kanamori. Density ratio estimation in machine learning . Cambridge University Press, 2012

work page 2012
[74]

High-performance medicine: the convergence of human and artificial intelligence

Eric J Topol. High-performance medicine: the convergence of human and artificial intelligence . Nature medicine, 25 0 (1): 0 44--56, 2019

work page 2019
[75]

Model Cascading: Towards Jointly Improving Efficiency and Accuracy of NLP Systems

Neeraj Varshney and Chitta Baral. Model Cascading: Towards Jointly Improving Efficiency and Accuracy of NLP Systems . In Yoav Goldberg, Zornitsa Kozareva, and Yue Zhang, editors, Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 11007--11021, Abu Dhabi, United Arab Emirates, December 2022. Association for Comput...

work page doi:10.18653/v1/2022.emnlp-main.756 2022
[76]

Calibrated learning to defer with one-vs-all classifiers

Rajeev Verma and Eric Nalisnick. Calibrated learning to defer with one-vs-all classifiers . In International Conference on Machine Learning, pages 22184--22202. PMLR, 2022

work page 2022
[77]

Learning to defer to multiple experts: Consistent surrogate losses, confidence calibration, and conformal ensembles

Rajeev Verma, Daniel Barrej \'o n, and Eric Nalisnick. Learning to defer to multiple experts: Consistent surrogate losses, confidence calibration, and conformal ensembles . In International Conference on Artificial Intelligence and Statistics, pages 11415--11434. PMLR, 2023

work page 2023
[78]

Viola and M

P. Viola and M. Jones. Rapid object detection using a boosted cascade of simple features . In Proceedings of the 2001 IEEE Computer Society Conference on Computer Vision and Pattern Recognition. CVPR 2001, volume 1, pages I--I, 2001. doi:10.1109/CVPR.2001.990517

work page doi:10.1109/cvpr.2001.990517 2001
[79]

Kitani, Yair Movshovitz-Attias, and Elad Eban

Xiaofang Wang, Dan Kondratyuk, Eric Christiansen, Kris M. Kitani, Yair Movshovitz-Attias, and Elad Eban. Wisdom of Committees: An Overlooked Approach To Faster and More Accurate Models . In International Conference on Learning Representations, 2022. URL https://openreview.net/forum?id=MvO2t0vbs4-

work page 2022
[80]

MedMNIST Classification Decathlon: A Lightweight AutoML Benchmark for Medical Image Analysis

Jiancheng Yang, Rui Shi, and Bingbing Ni. MedMNIST Classification Decathlon: A Lightweight AutoML Benchmark for Medical Image Analysis . In IEEE 18th International Symposium on Biomedical Imaging (ISBI), pages 191--195, 2021

work page 2021

Showing first 80 references.

[1] [1]

Methods of information geometry , volume 191

Shun-ichi Amari and Hiroshi Nagaoka. Methods of information geometry , volume 191. American Mathematical Soc., 2000

work page 2000

[2] [2]

Classification with a Reject Option using a Hinge Loss

Peter L Bartlett and Marten H Wegkamp. Classification with a Reject Option using a Hinge Loss. Journal of Machine Learning Research, 9 0 (8), 2008

work page 2008

[3] [3]

Discriminative learning under covariate shift

Steffen Bickel, Michael Br \"u ckner, and Tobias Scheffer. Discriminative learning under covariate shift. Journal of Machine Learning Research, 10 0 (9), 2009

work page 2009

[4] [4]

Loss functions for binary class probability estimation and classification: Structure and applications

Andreas Buja, Werner Stuetzle, and Yi Shen. Loss functions for binary class probability estimation and classification: Structure and applications . Working draft, November, 3: 0 13, 2005

work page 2005

[5] [5]

How the machine `thinks': Understanding opacity in machine learning algorithms

Jenna Burrell. How the machine `thinks': Understanding opacity in machine learning algorithms . Big data & society, 3 0 (1): 0 2053951715622512, 2016

work page 2016

[6] [6]

Generalizing consistent multi-class classification with rejection to be compatible with arbitrary losses

Yuzhou Cao, Tianchi Cai, Lei Feng, Lihong Gu, Jinjie Gu, Bo An, Gang Niu, and Masashi Sugiyama. Generalizing consistent multi-class classification with rejection to be compatible with arbitrary losses . Advances in neural information processing systems, 35: 0 521--534, 2022

work page 2022

[7] [7]

Anomaly detection: A survey

Varun Chandola, Arindam Banerjee, and Vipin Kumar. Anomaly detection: A survey . ACM computing surveys (CSUR), 41 0 (3): 0 1--58, 2009

work page 2009

[8] [8]

Classification with Rejection Based on Cost-sensitive Classification

Nontawat Charoenphakdee, Zhenghang Cui, Yivan Zhang, and Masashi Sugiyama. Classification with Rejection Based on Cost-sensitive Classification . In Marina Meila and Tong Zhang, editors, Proceedings of the 38th International Conference on Machine Learning, volume 139 of Proceedings of Machine Learning Research, pages 1507--1517. PMLR, 18--24 Jul 2021. URL...

work page 2021

[9] [9]

A unifying post-processing framework for multi-objective learn-to-defer problems

Mohammad-Amin Charusaie and Samira Samadi. A unifying post-processing framework for multi-objective learn-to-defer problems . Advances in Neural Information Processing Systems, 37: 0 23705--23755, 2024

work page 2024

[10] [10]

C. Chow. On optimum recognition error and reject tradeoff . IEEE Transactions on Information Theory, 16: 0 41--46, 1970. doi:10.1109/TIT.1970.1054406

work page doi:10.1109/tit.1970.1054406 1970

[11] [11]

An optimum character recognition system using decision functions

Chi-Keung Chow. An optimum character recognition system using decision functions . IRE Transactions on Electronic Computers, EC-6 0 (4): 0 247--254, 1957. doi:10.1109/TEC.1957.5222035

work page doi:10.1109/tec.1957.5222035 1957

[12] [12]

Learning with rejection

Corinna Cortes, Giulia DeSalvo, and Mehryar Mohri. Learning with rejection . In International conference on algorithmic learning theory, pages 67--82. Springer, 2016

work page 2016

[13] [13]

arXiv preprint arXiv:2510.26706 , year=

Giulia DeSalvo, Clara Mohri, Mehryar Mohri, and Yutao Zhong. Budgeted multiple-expert deferral . arXiv preprint arXiv:2510.26706, 2025

work page arXiv 2025

[14] [14]

Dohan, W

David Dohan, Winnie Xu, Aitor Lewkowycz, Jacob Austin, David Bieber, Raphael Gontijo Lopes, Yuhuai Wu, Henryk Michalewski, Rif A Saurous, Jascha Sohl-Dickstein, et al. Language model cascades . arXiv preprint arXiv:2207.10342, 2022

work page arXiv 2022

[15] [15]

Statistics of robust optimization: A generalized empirical likelihood approach

John C Duchi, Peter W Glynn, and Hongseok Namkoong. Statistics of robust optimization: A generalized empirical likelihood approach . Mathematics of Operations Research, 46 0 (3): 0 946--969, 2021

work page 2021

[16] [16]

A framework for robustness certification of smoothed classifiers using f-divergences

Krishnamurthy Dj Dvijotham, Jamie Hayes, Borja Balle, Zico Kolter, Chongli Qin, Andras Gyorgy, Kai Xiao, Sven Gowal, and Pushmeet Kohli. A framework for robustness certification of smoothed classifiers using f-divergences . In International Conference on Learning Representations, 2020

work page 2020

[17] [17]

On the Foundations of Noise-free Selective Classification

Ran El-Yaniv et al. On the Foundations of Noise-free Selective Classification. Journal of Machine Learning Research, 11 0 (5), 2010

work page 2010

[18] [18]

On the probability function in the collective theory of risk

F Escher. On the probability function in the collective theory of risk . Skand. Aktuarie Tidskr., 15: 0 175--195, 1932

work page 1932

[19] [19]

Dermatologist-level classification of skin cancer with deep neural networks

Andre Esteva, Brett Kuprel, Roberto A Novoa, Justin Ko, Susan M Swetter, Helen M Blau, and Sebastian Thrun. Dermatologist-level classification of skin cancer with deep neural networks . nature, 542 0 (7639): 0 115--118, 2017

work page 2017

[20] [20]

Optimal strategies for reject option classifiers

Vojtech Franc, Daniel Prusa, and Vaclav Voracek. Optimal strategies for reject option classifiers . Journal of Machine Learning Research, 24 0 (11): 0 1--49, 2023

work page 2023

[21] [21]

Selective classification for deep neural networks

Yonatan Geifman and Ran El-Yaniv. Selective classification for deep neural networks . Advances in neural information processing systems, 30, 2017

work page 2017

[22] [22]

S elective N et: A Deep Neural Network with an Integrated Reject Option

Yonatan Geifman and Ran El-Yaniv. S elective N et: A Deep Neural Network with an Integrated Reject Option . In Kamalika Chaudhuri and Ruslan Salakhutdinov, editors, Proceedings of the 36th International Conference on Machine Learning, volume 97 of Proceedings of Machine Learning Research, pages 2151--2159. PMLR, 09--15 Jun 2019. URL https://proceedings.ml...

work page 2019

[23] [23]

Language Model Cascades: Token-Level Uncertainty And Beyond

Neha Gupta, Harikrishna Narasimhan, Wittawat Jitkrittum, Ankit Singh Rawat, Aditya Krishna Menon, and Sanjiv Kumar. Language Model Cascades: Token-Level Uncertainty And Beyond . In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=KgaBScZ4VI

work page 2024

[24] [24]

Classification with reject option

Radu Herbei and Marten H Wegkamp. Classification with reject option . The Canadian Journal of Statistics/La Revue Canadienne de Statistique, pages 709--721, 2006

work page 2006

[25] [25]

Aleatoric and epistemic uncertainty in machine learning: An introduction to concepts and methods

Eyke H \"u llermeier and Willem Waegeman. Aleatoric and epistemic uncertainty in machine learning: An introduction to concepts and methods . Machine learning, 110 0 (3): 0 457--506, 2021

work page 2021

[26] [26]

When does confidence-based cascade deferral suffice? Advances in Neural Information Processing Systems, 36: 0 9891--9906, 2023

Wittawat Jitkrittum, Neha Gupta, Aditya K Menon, Harikrishna Narasimhan, Ankit Rawat, and Sanjiv Kumar. When does confidence-based cascade deferral suffice? Advances in Neural Information Processing Systems, 36: 0 9891--9906, 2023

work page 2023

[27] [27]

Skin cancer detection: Applying a deep learning based model driven architecture in the cloud for classifying dermal cell images

Mohammad Ali Kadampur and Sulaiman Al Riyaee. Skin cancer detection: Applying a deep learning based model driven architecture in the cloud for classifying dermal cell images . Informatics in Medicine Unlocked, 18: 0 100282, 2020

work page 2020

[28] [28]

Efficient edge inference by selective query

Anil Kag, Igor Fedorov, Aditya Gangrade, Paul Whatmough, and Venkatesh Saligrama. Efficient edge inference by selective query . In The Eleventh International Conference on Learning Representations, 2022

work page 2022

[29] [29]

A least-squares approach to direct importance estimation

Takafumi Kanamori, Shohei Hido, and Masashi Sugiyama. A least-squares approach to direct importance estimation. The Journal of Machine Learning Research, 10: 0 1391--1445, 2009

work page 2009

[30] [30]

BabyBear: Cheap inference triage for expensive language models

Leila Khalili, Yao You, and John Bohannon. BabyBear: Cheap inference triage for expensive language models . arXiv preprint arXiv:2205.11747, 2022. URL https://arxiv.org/abs/2205.11747

work page arXiv 2022

[31] [31]

Adam: A Method for Stochastic Optimization

Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization . arXiv preprint arXiv:1412.6980, 2014

work page internal anchor Pith review Pith/arXiv arXiv 2014

[32] [32]

An optimization-centric view on Bayes' rule: Reviewing and generalizing variational inference

Jeremias Knoblauch, Jack Jewson, and Theodoros Damoulas. An optimization-centric view on Bayes' rule: Reviewing and generalizing variational inference . Journal of Machine Learning Research, 23 0 (132): 0 1--109, 2022

work page 2022

[33] [33]

Two Notes on Notation

Donald E Knuth. Two Notes on Notation . The American Mathematical Monthly, 99: 0 403--422, 1992

work page 1992

[34] [34]

Learning multiple layers of features from tiny images

Alex Krizhevsky. Learning multiple layers of features from tiny images . Technical report, 2009

work page 2009

[35] [35]

E. L. Lehmann and Joseph P. Romano. Testing Statistical Hypotheses . Springer International Publishing, 2005

work page 2005

[36] [36]

Large language models in finance: A survey

Yinheng Li, Shaofei Wang, Han Ding, and Hang Chen. Large language models in finance: A survey . In Proceedings of the fourth ACM international conference on AI in finance, pages 374--382, 2023

work page 2023

[37] [37]

The Inductive Bias of Restricted f-GANs

Shuang Liu and Kamalika Chaudhuri. The inductive bias of restricted f-gans . arXiv preprint arXiv:1809.04542, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[38] [38]

When more experts hurt: Underfitting in multi-expert learning to defer

Shuqi Liu, Yuzhou Cao, Lei Feng, Bo An, and Luke Ong. When More Experts Hurt: Underfitting in Multi-Expert Learning to Defer . arXiv preprint arXiv:2602.17144, 2026

work page arXiv 2026

[39] [39]

Segment anything in medical images

Jun Ma, Yuting He, Feifei Li, Lin Han, Chenyu You, and Bo Wang. Segment anything in medical images . Nature communications, 15 0 (1): 0 654, 2024

work page 2024

[40] [40]

Predict responsibly: improving fairness and accuracy by learning to defer

David Madras, Toni Pitassi, and Richard Zemel. Predict responsibly: improving fairness and accuracy by learning to defer . Advances in neural information processing systems, 31, 2018

work page 2018

[41] [41]

Tangobert: Reducing inference cost by using cascaded architecture

Jonathan Mamou, Oren Pereg, Moshe Wasserblat, and Roy Schwartz. Tangobert: Reducing inference cost by using cascaded architecture . arXiv preprint arXiv:2204.06271, 2022

work page arXiv 2022

[42] [42]

Two-Stage Learning to Defer with Multiple Experts

Anqi Mao, Christopher Mohri, Mehryar Mohri, and Yutao Zhong. Two-Stage Learning to Defer with Multiple Experts . Advances in Neural Information Processing Systems, 36: 0 3578--3606, 2023

work page 2023

[43] [43]

Predictor-rejector multi-class abstention: Theoretical analysis and algorithms

Anqi Mao, Mehryar Mohri, and Yutao Zhong. Predictor-rejector multi-class abstention: Theoretical analysis and algorithms . In International Conference on Algorithmic Learning Theory, pages 822--867. PMLR, 2024 a

work page 2024

[44] [44]

Principled approaches for learning to defer with multiple experts

Anqi Mao, Mehryar Mohri, and Yutao Zhong. Principled approaches for learning to defer with multiple experts . In International Workshop on Combinatorial Image Analysis, pages 107--135. Springer, 2024 b

work page 2024

[45] [45]

Mastering Multiple-Expert Routing: Realizable \ H\ -Consistency and Strong Guarantees for Learning to Defer

Anqi Mao, Mehryar Mohri, and Yutao Zhong. Mastering Multiple-Expert Routing: Realizable \ H\ -Consistency and Strong Guarantees for Learning to Defer . In Forty-second International Conference on Machine Learning, 2025. URL https://openreview.net/forum?id=2KlxjR6lsd

work page 2025

[46] [46]

Linking losses for density ratio and class-probability estimation

Aditya Krishna Menon and Cheng Soon Ong. Linking losses for density ratio and class-probability estimation . In International Conference on Machine Learning, pages 304--313, 2016

work page 2016

[47] [47]

A loss framework for calibrated anomaly detection

Aditya Krishna Menon and Robert C Williamson. A loss framework for calibrated anomaly detection . In Proceedings of the 32nd international conference on neural information processing systems, pages 1494--1504, 2018

work page 2018

[48] [48]

Feynman-Kac Formulae

Pierre Del Moral. Feynman-Kac Formulae . Springer, 2004

work page 2004

[49] [49]

Consistent estimators for learning to defer to an expert

Hussein Mozannar and David Sontag. Consistent estimators for learning to defer to an expert . In International conference on machine learning, pages 7076--7087. PMLR, 2020

work page 2020

[50] [50]

Who should predict? exact algorithms for learning to defer to humans

Hussein Mozannar, Hunter Lang, Dennis Wei, Prasanna Sattigeri, Subhro Das, and David Sontag. Who should predict? exact algorithms for learning to defer to humans . In International conference on artificial intelligence and statistics, pages 10520--10545. PMLR, 2023

work page 2023

[51] [51]

Learning to reject meets long-tail learning

Harikrishna Narasimhan, Aditya Krishna Menon, Wittawat Jitkrittum, Neha Gupta, and Sanjiv Kumar. Learning to reject meets long-tail learning . In The Twelfth International Conference on Learning Representations, 2024

work page 2024

[52] [52]

Faster Cascades via Speculative Decoding

Harikrishna Narasimhan, Wittawat Jitkrittum, Ankit Singh Rawat, Seungyeon Kim, Neha Gupta, Aditya Krishna Menon, and Sanjiv Kumar. Faster Cascades via Speculative Decoding . In The Thirteenth International Conference on Learning Representations, 2025. URL https://openreview.net/forum?id=vo9t20wsmd

work page 2025

[53] [53]

Jerzy Neyman and Egon S. Pearson. IX. On the problem of the most efficient tests of statistical hypotheses . Philosophical Transactions of the Royal Society of London Series A Containing Papers of a Mathematical or Physical Character, 231: 0 289--337, 1933. doi:10.1098/rsta.1933.0009

work page doi:10.1098/rsta.1933.0009 1933

[54] [54]

On the calibration of multiclass classification with rejection

Chenri Ni, Nontawat Charoenphakdee, Junya Honda, and Masashi Sugiyama. On the calibration of multiclass classification with rejection . Advances in neural information processing systems, 32, 2019

work page 2019

[55] [55]

A scaled Bregman theorem with applications

Richard Nock, Aditya Menon, and Cheng Soon Ong. A scaled Bregman theorem with applications . In Advances in Neural Information Processing Systems, volume 29. Curran Associates, Inc., 2016 a

work page 2016

[56] [56]

A scaled Bregman theorem with applications

Richard Nock, Aditya Menon, and Cheng Soon Ong. A scaled Bregman theorem with applications . Advances in Neural Information Processing Systems, 29, 2016 b

work page 2016

[57] [57]

Differentiable learning under triage

Nastaran Okati, Abir De, and Manuel Gomez-Rodriguez. Differentiable learning under triage . Advances in Neural Information Processing Systems, 34: 0 9140--9151, 2021

work page 2021

[58] [58]

Change of measure through the Legendre transform

Antoine Picard-Weibel and Benjamin Guedj. On change of measure inequalities for f -divergences . arXiv preprint arXiv:2202.05568, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[59] [59]

AUC-based Selective Classification

Andrea Pugnana and Salvatore Ruggieri. AUC-based Selective Classification . In Francisco Ruiz, Jennifer Dy, and Jan-Willem van de Meent, editors, Proceedings of The 26th International Conference on Artificial Intelligence and Statistics, volume 206 of Proceedings of Machine Learning Research, pages 2494--2514. PMLR, 25--27 Apr 2023 a . URL https://proceed...

work page 2023

[60] [60]

A Model-Agnostic Heuristics for Selective Classification

Andrea Pugnana and Salvatore Ruggieri. A Model-Agnostic Heuristics for Selective Classification . Proceedings of the AAAI Conference on Artificial Intelligence, 37 0 (8): 0 9461--9469, Jun. 2023 b . doi:10.1609/aaai.v37i8.26133. URL https://ojs.aaai.org/index.php/AAAI/article/view/26133

work page doi:10.1609/aaai.v37i8.26133 2023

[61] [61]

Consistent algorithms for multiclass classification with an abstain option

Harish G Ramaswamy, Ambuj Tewari, and Shivani Agarwal. Consistent algorithms for multiclass classification with an abstain option . Electronic Journal of Statistics, 12: 0 530--554, 2018

work page 2018

[62] [62]

Reid and Robert C

Mark D. Reid and Robert C. Williamson. Composite Binary Losses . Journal of Machine Learning Research, 11: 0 2387--2422, 2010

work page 2010

[63] [63]

Information, Divergence and Risk for Binary Experiments

Mark D Reid and Robert C Williamson. Information, Divergence and Risk for Binary Experiments . Journal of Machine Learning Research, 12: 0 731--817, 2011

work page 2011

[64] [64]

Pattern recognition and neural networks

Brian D Ripley. Pattern recognition and neural networks . Cambridge university press, 2007

work page 2007

[65] [65]

R. T. Rockafellar. Convex Analysis . Princeton University Press, 1970

work page 1970

[66] [66]

Loss Functions and Operators Generated by f-Divergences

Vincent Roulet, Tianlin Liu, Nino Vieillard, Michael Eli Sander, and Mathieu Blondel. Loss Functions and Operators Generated by f-Divergences . In Forty-second International Conference on Machine Learning, 2025. URL https://openreview.net/forum?id=V1YfPJDliw

work page 2025

[67] [67]

A min-max solution of an inventory problem

Herbert E Scarf, KJ Arrow, and S Karlin. A min-max solution of an inventory problem . Technical report, Rand Corporation Santa Monica, 1957

work page 1957

[68] [68]

Toward expert-level medical question answering with large language models

Karan Singhal, Tao Tu, Juraj Gottweis, Rory Sayres, Ellery Wulczyn, Mohamed Amin, Le Hou, Kevin Clark, Stephen R Pfohl, Heather Cole-Lewis, et al. Toward expert-level medical question answering with large language models . Nature medicine, 31 0 (3): 0 943--950, 2025

work page 2025

[69] [69]

A Connection Between Learning to Reject and Bhattacharyya Divergences

Alexander Soen. A Connection Between Learning to Reject and Bhattacharyya Divergences . In Geometric Science of Information, pages 369--377. Springer Nature Switzerland, 2026. doi:10.1007/978-3-032-03918-7_38

work page doi:10.1007/978-3-032-03918-7_38 2026

[70] [70]

Rejection via Learning Density Ratios

Alexander Soen, Hisham Husain, Philip Schulz, and Vu Nguyen. Rejection via Learning Density Ratios . In The Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024

work page 2024

[71] [71]

A Classification Framework for Anomaly Detection

Ingo Steinwart, Don Hush, and Clint Scovel. A Classification Framework for Anomaly Detection. Journal of Machine Learning Research, 6 0 (2), 2005

work page 2005

[72] [72]

Sugiyama, T

Masashi Sugiyama, Taiji Suzuki, Shinichi Nakajima, Hisashi Kashima, Paul von B\" u nau, and Motoaki Kawanabe. Direct importance estimation for covariate shift adaptation . Annals of the Institute of Statistical Mathematics, 60: 0 699--746, 2008. doi:10.1007/s10463-008-0197-x

work page doi:10.1007/s10463-008-0197-x 2008

[73] [73]

Density ratio estimation in machine learning

Masashi Sugiyama, Taiji Suzuki, and Takafumi Kanamori. Density ratio estimation in machine learning . Cambridge University Press, 2012

work page 2012

[74] [74]

High-performance medicine: the convergence of human and artificial intelligence

Eric J Topol. High-performance medicine: the convergence of human and artificial intelligence . Nature medicine, 25 0 (1): 0 44--56, 2019

work page 2019

[75] [75]

Model Cascading: Towards Jointly Improving Efficiency and Accuracy of NLP Systems

Neeraj Varshney and Chitta Baral. Model Cascading: Towards Jointly Improving Efficiency and Accuracy of NLP Systems . In Yoav Goldberg, Zornitsa Kozareva, and Yue Zhang, editors, Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 11007--11021, Abu Dhabi, United Arab Emirates, December 2022. Association for Comput...

work page doi:10.18653/v1/2022.emnlp-main.756 2022

[76] [76]

Calibrated learning to defer with one-vs-all classifiers

Rajeev Verma and Eric Nalisnick. Calibrated learning to defer with one-vs-all classifiers . In International Conference on Machine Learning, pages 22184--22202. PMLR, 2022

work page 2022

[77] [77]

Learning to defer to multiple experts: Consistent surrogate losses, confidence calibration, and conformal ensembles

Rajeev Verma, Daniel Barrej \'o n, and Eric Nalisnick. Learning to defer to multiple experts: Consistent surrogate losses, confidence calibration, and conformal ensembles . In International Conference on Artificial Intelligence and Statistics, pages 11415--11434. PMLR, 2023

work page 2023

[78] [78]

Viola and M

P. Viola and M. Jones. Rapid object detection using a boosted cascade of simple features . In Proceedings of the 2001 IEEE Computer Society Conference on Computer Vision and Pattern Recognition. CVPR 2001, volume 1, pages I--I, 2001. doi:10.1109/CVPR.2001.990517

work page doi:10.1109/cvpr.2001.990517 2001

[79] [79]

Kitani, Yair Movshovitz-Attias, and Elad Eban

Xiaofang Wang, Dan Kondratyuk, Eric Christiansen, Kris M. Kitani, Yair Movshovitz-Attias, and Elad Eban. Wisdom of Committees: An Overlooked Approach To Faster and More Accurate Models . In International Conference on Learning Representations, 2022. URL https://openreview.net/forum?id=MvO2t0vbs4-

work page 2022

[80] [80]

MedMNIST Classification Decathlon: A Lightweight AutoML Benchmark for Medical Image Analysis

Jiancheng Yang, Rui Shi, and Bingbing Ni. MedMNIST Classification Decathlon: A Lightweight AutoML Benchmark for Medical Image Analysis . In IEEE 18th International Symposium on Biomedical Imaging (ISBI), pages 191--195, 2021

work page 2021