pith. sign in

arxiv: 2605.19557 · v1 · pith:BA2UWQGUnew · submitted 2026-05-19 · 📊 stat.ML · cs.LG

Density-Ratio Losses for Post-Hoc Learning to Defer

Pith reviewed 2026-05-20 02:38 UTC · model grok-4.3

classification 📊 stat.ML cs.LG
keywords learning to deferdensity ratio estimationpost-hoc methodsideal distributionsclass probability estimationChow's ruleexpert-tilted posterior
0
0 comments X

The pith

Post-hoc learning to defer reduces to estimating the density ratio between a model's and an expert's ideal distributions.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that deferral after a model is already trained can be handled by learning a scorer that estimates the density ratio between the ideal distribution for the model and the ideal distribution for the expert. Ideal distributions are reweighted versions of the data under which the model or expert attains low loss through divergence regularization. By reducing density-ratio estimation to class-probability estimation, the authors derive specific loss functions whose outputs can be thresholded to decide when to defer. A sympathetic reader would care because this approach permits changing the deferral rate with a simple threshold adjustment after training, recovers classical optimal rules such as Chow's for KL-based cases, and links deferral to an expert-tilted Bayes posterior that accounts for the expert's performance.

Core claim

We study post-hoc Learning to Defer (L2D) through the lens of ideal distributions: divergence-regularized reweightings of the data distribution under which a model attains low loss. We define deferral via the density-ratio between a model's and an expert's ideals. Using the reduction from density-ratio estimation to class-probability estimation, we derive the DR CPE losses for post-hoc L2D scorers. Deferral decisions are then made by thresholding the scorer, allowing deferral rates to be adjusted without retraining. For KL-based ideal distributions, our deferral rules recovers Chow's rule under the original distribution and a connection to an expert-tilted Bayes posterior -- which incorporat

What carries the argument

The density ratio between a model's ideal distribution and an expert's ideal distribution, obtained via divergence-regularized reweightings and reduced to class-probability estimation losses for training a post-hoc deferral scorer.

Load-bearing premise

The reduction from density-ratio estimation to class-probability estimation holds for the chosen ideal distributions and the resulting scorer can be thresholded to produce valid deferral decisions without additional calibration or assumptions on the joint distribution of model and expert errors.

What would settle it

An experiment in which the DR CPE scorer's thresholded outputs fail to recover Chow's rule under the original distribution or fail to match the expert-tilted Bayes posterior for KL-based ideal distributions would falsify the central derivation.

Figures

Figures reproduced from arXiv: 2605.19557 by Alexander Soen, Joakim Jald\'en, Ragnar Thobaben, Richard Nock.

Figure 1
Figure 1. Figure 1: Post-hoc L2D to learnable density-ratio (DR) deferral. Starting from the post-hoc L2D objective in [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Accuracy-deferral trade-off. Titles detail the dataset and ResNet utilized, [PITH_FULL_IMAGE:figures/full_fig_p010_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Accuracy-deferral trade-off plots over different [PITH_FULL_IMAGE:figures/full_fig_p035_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Accuracy-deferral trade-off plots over both accuracy and top- [PITH_FULL_IMAGE:figures/full_fig_p036_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Accuracy-deferral trade-off plots over different DR partial losses (with fixed [PITH_FULL_IMAGE:figures/full_fig_p038_5.png] view at source ↗
read the original abstract

We study post-hoc Learning to Defer (L2D) through the lens of ideal distributions: divergence-regularized reweightings of the data distribution under which a model attains low loss. We define deferral via the density-ratio between a model's and an expert's ideals. Using the reduction from density-ratio estimation to class-probability estimation, we derive the DR CPE losses for post-hoc L2D scorers. Deferral decisions are then made by thresholding the scorer, allowing deferral rates to be adjusted without retraining. For KL-based ideal distributions, our deferral rules recovers Chow's rule under the original distribution and a connection to an expert-tilted Bayes posterior -- which incorporates the expert's performance -- depending on if the ideal distributions are joint or marginal distributions. Experimentally, our approach is competitive compared to common baselines and more robust across dataset settings. More broadly, our results cast post-hoc L2D as density-ratio learning between ideal distributions, bridging Chow-style rules, expert comparison, and elucidating connections to related learning settings including anomaly detection.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes framing post-hoc learning to defer (L2D) as density-ratio estimation between 'ideal' distributions, defined as divergence-regularized reweightings of the original data measure under which a model or expert attains low loss. It derives DR-CPE losses by reducing density-ratio estimation to class-probability estimation, obtains deferral scorers that can be thresholded to control deferral rate without retraining, and shows that the KL-based case recovers Chow's rule (under joint ideals) or connects to an expert-tilted Bayes posterior (under marginal ideals). Experiments are reported as competitive with baselines and more robust across settings.

Significance. If the central reduction and thresholding argument hold without additional joint-error assumptions, the work supplies a clean density-ratio perspective that unifies Chow-style rules with post-hoc L2D and links to anomaly detection. The post-hoc, thresholdable nature is practically attractive. The manuscript does not yet provide quantitative tables, error bars, or dataset details in the abstract, so the strength of the empirical claim remains to be verified from the full experiments.

major comments (2)
  1. [§3] §3 (or the derivation following Eq. (3)): the claim that thresholding the DR-CPE scorer produces valid deferral sets appears to rest on the assumption that the class probability is monotonic in the conditional advantage of the expert over the model. The skeptic note correctly flags that marginal ideal distributions alone do not encode instance-level error correlation; the manuscript must explicitly state whether the joint versus marginal choice of ideals is sufficient to guarantee monotonicity, or whether an extra assumption on the joint distribution of model/expert errors is required.
  2. [§4] The recovery of Chow's rule for KL-based joint ideals is stated in the abstract and presumably derived in §4. The derivation must be checked for circularity: if the reweighting factors that define the ideals are realized by a loss on the original data, the resulting ratio estimator must not implicitly presuppose the very deferral decision it is meant to produce. A concrete walk-through of the steps from the ideal densities to the thresholded rule would clarify this.
minor comments (2)
  1. [Abstract] The abstract asserts that experiments are 'competitive and more robust' yet supplies no quantitative results, error bars, or dataset names. These details belong in the abstract or a prominent table in §5.
  2. [Abstract] Notation for the ideal distributions (joint vs. marginal) should be introduced once and used consistently; the current abstract switches between them without a clear forward pointer to the section that defines both.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments and the recommendation of major revision. We address each major comment point by point below and will revise the manuscript to incorporate the requested clarifications.

read point-by-point responses
  1. Referee: [§3] §3 (or the derivation following Eq. (3)): the claim that thresholding the DR-CPE scorer produces valid deferral sets appears to rest on the assumption that the class probability is monotonic in the conditional advantage of the expert over the model. The skeptic note correctly flags that marginal ideal distributions alone do not encode instance-level error correlation; the manuscript must explicitly state whether the joint versus marginal choice of ideals is sufficient to guarantee monotonicity, or whether an extra assumption on the joint distribution of model/expert errors is required.

    Authors: We agree that explicit clarification is needed on this point. When joint ideal distributions are used, the density ratio is taken with respect to the joint measure over instances, which directly incorporates the per-instance losses of both the model and the expert. Consequently the class probability obtained from the DR-CPE reduction is monotonic in the conditional advantage of the expert, so that thresholding yields valid deferral sets without further assumptions. When marginal ideal distributions are used, instance-level error correlations are not encoded and monotonicity would indeed require an additional assumption on the joint distribution of model/expert errors. We will revise §3 to state this distinction clearly and to specify the conditions under which the thresholding argument holds. revision: yes

  2. Referee: [§4] The recovery of Chow's rule for KL-based joint ideals is stated in the abstract and presumably derived in §4. The derivation must be checked for circularity: if the reweighting factors that define the ideals are realized by a loss on the original data, the resulting ratio estimator must not implicitly presuppose the very deferral decision it is meant to produce. A concrete walk-through of the steps from the ideal densities to the thresholded rule would clarify this.

    Authors: We thank the referee for raising the possibility of circularity. The ideal distributions are purely theoretical objects: divergence-regularized reweightings of the original measure under which the model or expert attains low loss. The density-ratio estimator is obtained by applying the DR-CPE reduction directly to samples drawn from the original data distribution; the estimator is trained using only the observed per-instance losses of the model and expert and does not involve any deferral decisions or thresholds. The subsequent thresholding step is applied after estimation and does not feed back into the training of the scorer. We will add a concise, numbered walk-through of the steps from the definition of the ideal densities through the DR-CPE reduction to the final thresholded rule in the revised §4. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation uses external definitions and standard reductions

full rationale

The paper starts from the external notion of ideal distributions (divergence-regularized reweightings of the data measure) and defines deferral explicitly as the density ratio between model and expert ideals. It then invokes the known, independently established reduction from density-ratio estimation to class-probability estimation to obtain the DR-CPE losses. Thresholding the resulting scorer is presented as a direct consequence of this construction. For the KL case the paper shows recovery of Chow's rule under the original distribution, which is an external benchmark rather than a self-derived quantity. No equation or step equates a fitted parameter to a prediction by construction, and no load-bearing premise rests solely on self-citation. The central claim therefore remains independent of its own outputs.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on the existence of well-defined ideal distributions for both model and expert, the validity of the density-ratio-to-CPE reduction, and the assumption that thresholding the resulting scorer yields useful deferral without further calibration. No explicit free parameters or invented entities are named in the abstract.

axioms (2)
  • domain assumption Ideal distributions exist and can be used to define a meaningful density ratio for deferral decisions.
    Invoked when deferral is defined via the density-ratio between model's and expert's ideals.
  • standard math The standard reduction from density-ratio estimation to class-probability estimation applies directly to the chosen ideal distributions.
    Used to derive the DR CPE losses.

pith-pipeline@v0.9.0 · 5724 in / 1680 out tokens · 37157 ms · 2026-05-20T02:38:52.614854+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

88 extracted references · 88 canonical work pages · 3 internal anchors

  1. [1]

    Methods of information geometry , volume 191

    Shun-ichi Amari and Hiroshi Nagaoka. Methods of information geometry , volume 191. American Mathematical Soc., 2000

  2. [2]

    Classification with a Reject Option using a Hinge Loss

    Peter L Bartlett and Marten H Wegkamp. Classification with a Reject Option using a Hinge Loss. Journal of Machine Learning Research, 9 0 (8), 2008

  3. [3]

    Discriminative learning under covariate shift

    Steffen Bickel, Michael Br \"u ckner, and Tobias Scheffer. Discriminative learning under covariate shift. Journal of Machine Learning Research, 10 0 (9), 2009

  4. [4]

    Loss functions for binary class probability estimation and classification: Structure and applications

    Andreas Buja, Werner Stuetzle, and Yi Shen. Loss functions for binary class probability estimation and classification: Structure and applications . Working draft, November, 3: 0 13, 2005

  5. [5]

    How the machine `thinks': Understanding opacity in machine learning algorithms

    Jenna Burrell. How the machine `thinks': Understanding opacity in machine learning algorithms . Big data & society, 3 0 (1): 0 2053951715622512, 2016

  6. [6]

    Generalizing consistent multi-class classification with rejection to be compatible with arbitrary losses

    Yuzhou Cao, Tianchi Cai, Lei Feng, Lihong Gu, Jinjie Gu, Bo An, Gang Niu, and Masashi Sugiyama. Generalizing consistent multi-class classification with rejection to be compatible with arbitrary losses . Advances in neural information processing systems, 35: 0 521--534, 2022

  7. [7]

    Anomaly detection: A survey

    Varun Chandola, Arindam Banerjee, and Vipin Kumar. Anomaly detection: A survey . ACM computing surveys (CSUR), 41 0 (3): 0 1--58, 2009

  8. [8]

    Classification with Rejection Based on Cost-sensitive Classification

    Nontawat Charoenphakdee, Zhenghang Cui, Yivan Zhang, and Masashi Sugiyama. Classification with Rejection Based on Cost-sensitive Classification . In Marina Meila and Tong Zhang, editors, Proceedings of the 38th International Conference on Machine Learning, volume 139 of Proceedings of Machine Learning Research, pages 1507--1517. PMLR, 18--24 Jul 2021. URL...

  9. [9]

    A unifying post-processing framework for multi-objective learn-to-defer problems

    Mohammad-Amin Charusaie and Samira Samadi. A unifying post-processing framework for multi-objective learn-to-defer problems . Advances in Neural Information Processing Systems, 37: 0 23705--23755, 2024

  10. [10]

    C. Chow. On optimum recognition error and reject tradeoff . IEEE Transactions on Information Theory, 16: 0 41--46, 1970. doi:10.1109/TIT.1970.1054406

  11. [11]

    An optimum character recognition system using decision functions

    Chi-Keung Chow. An optimum character recognition system using decision functions . IRE Transactions on Electronic Computers, EC-6 0 (4): 0 247--254, 1957. doi:10.1109/TEC.1957.5222035

  12. [12]

    Learning with rejection

    Corinna Cortes, Giulia DeSalvo, and Mehryar Mohri. Learning with rejection . In International conference on algorithmic learning theory, pages 67--82. Springer, 2016

  13. [13]

    arXiv preprint arXiv:2510.26706 , year=

    Giulia DeSalvo, Clara Mohri, Mehryar Mohri, and Yutao Zhong. Budgeted multiple-expert deferral . arXiv preprint arXiv:2510.26706, 2025

  14. [14]

    Dohan, W

    David Dohan, Winnie Xu, Aitor Lewkowycz, Jacob Austin, David Bieber, Raphael Gontijo Lopes, Yuhuai Wu, Henryk Michalewski, Rif A Saurous, Jascha Sohl-Dickstein, et al. Language model cascades . arXiv preprint arXiv:2207.10342, 2022

  15. [15]

    Statistics of robust optimization: A generalized empirical likelihood approach

    John C Duchi, Peter W Glynn, and Hongseok Namkoong. Statistics of robust optimization: A generalized empirical likelihood approach . Mathematics of Operations Research, 46 0 (3): 0 946--969, 2021

  16. [16]

    A framework for robustness certification of smoothed classifiers using f-divergences

    Krishnamurthy Dj Dvijotham, Jamie Hayes, Borja Balle, Zico Kolter, Chongli Qin, Andras Gyorgy, Kai Xiao, Sven Gowal, and Pushmeet Kohli. A framework for robustness certification of smoothed classifiers using f-divergences . In International Conference on Learning Representations, 2020

  17. [17]

    On the Foundations of Noise-free Selective Classification

    Ran El-Yaniv et al. On the Foundations of Noise-free Selective Classification. Journal of Machine Learning Research, 11 0 (5), 2010

  18. [18]

    On the probability function in the collective theory of risk

    F Escher. On the probability function in the collective theory of risk . Skand. Aktuarie Tidskr., 15: 0 175--195, 1932

  19. [19]

    Dermatologist-level classification of skin cancer with deep neural networks

    Andre Esteva, Brett Kuprel, Roberto A Novoa, Justin Ko, Susan M Swetter, Helen M Blau, and Sebastian Thrun. Dermatologist-level classification of skin cancer with deep neural networks . nature, 542 0 (7639): 0 115--118, 2017

  20. [20]

    Optimal strategies for reject option classifiers

    Vojtech Franc, Daniel Prusa, and Vaclav Voracek. Optimal strategies for reject option classifiers . Journal of Machine Learning Research, 24 0 (11): 0 1--49, 2023

  21. [21]

    Selective classification for deep neural networks

    Yonatan Geifman and Ran El-Yaniv. Selective classification for deep neural networks . Advances in neural information processing systems, 30, 2017

  22. [22]

    S elective N et: A Deep Neural Network with an Integrated Reject Option

    Yonatan Geifman and Ran El-Yaniv. S elective N et: A Deep Neural Network with an Integrated Reject Option . In Kamalika Chaudhuri and Ruslan Salakhutdinov, editors, Proceedings of the 36th International Conference on Machine Learning, volume 97 of Proceedings of Machine Learning Research, pages 2151--2159. PMLR, 09--15 Jun 2019. URL https://proceedings.ml...

  23. [23]

    Language Model Cascades: Token-Level Uncertainty And Beyond

    Neha Gupta, Harikrishna Narasimhan, Wittawat Jitkrittum, Ankit Singh Rawat, Aditya Krishna Menon, and Sanjiv Kumar. Language Model Cascades: Token-Level Uncertainty And Beyond . In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=KgaBScZ4VI

  24. [24]

    Classification with reject option

    Radu Herbei and Marten H Wegkamp. Classification with reject option . The Canadian Journal of Statistics/La Revue Canadienne de Statistique, pages 709--721, 2006

  25. [25]

    Aleatoric and epistemic uncertainty in machine learning: An introduction to concepts and methods

    Eyke H \"u llermeier and Willem Waegeman. Aleatoric and epistemic uncertainty in machine learning: An introduction to concepts and methods . Machine learning, 110 0 (3): 0 457--506, 2021

  26. [26]

    When does confidence-based cascade deferral suffice? Advances in Neural Information Processing Systems, 36: 0 9891--9906, 2023

    Wittawat Jitkrittum, Neha Gupta, Aditya K Menon, Harikrishna Narasimhan, Ankit Rawat, and Sanjiv Kumar. When does confidence-based cascade deferral suffice? Advances in Neural Information Processing Systems, 36: 0 9891--9906, 2023

  27. [27]

    Skin cancer detection: Applying a deep learning based model driven architecture in the cloud for classifying dermal cell images

    Mohammad Ali Kadampur and Sulaiman Al Riyaee. Skin cancer detection: Applying a deep learning based model driven architecture in the cloud for classifying dermal cell images . Informatics in Medicine Unlocked, 18: 0 100282, 2020

  28. [28]

    Efficient edge inference by selective query

    Anil Kag, Igor Fedorov, Aditya Gangrade, Paul Whatmough, and Venkatesh Saligrama. Efficient edge inference by selective query . In The Eleventh International Conference on Learning Representations, 2022

  29. [29]

    A least-squares approach to direct importance estimation

    Takafumi Kanamori, Shohei Hido, and Masashi Sugiyama. A least-squares approach to direct importance estimation. The Journal of Machine Learning Research, 10: 0 1391--1445, 2009

  30. [30]

    BabyBear: Cheap inference triage for expensive language models

    Leila Khalili, Yao You, and John Bohannon. BabyBear: Cheap inference triage for expensive language models . arXiv preprint arXiv:2205.11747, 2022. URL https://arxiv.org/abs/2205.11747

  31. [31]

    Adam: A Method for Stochastic Optimization

    Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization . arXiv preprint arXiv:1412.6980, 2014

  32. [32]

    An optimization-centric view on Bayes' rule: Reviewing and generalizing variational inference

    Jeremias Knoblauch, Jack Jewson, and Theodoros Damoulas. An optimization-centric view on Bayes' rule: Reviewing and generalizing variational inference . Journal of Machine Learning Research, 23 0 (132): 0 1--109, 2022

  33. [33]

    Two Notes on Notation

    Donald E Knuth. Two Notes on Notation . The American Mathematical Monthly, 99: 0 403--422, 1992

  34. [34]

    Learning multiple layers of features from tiny images

    Alex Krizhevsky. Learning multiple layers of features from tiny images . Technical report, 2009

  35. [35]

    E. L. Lehmann and Joseph P. Romano. Testing Statistical Hypotheses . Springer International Publishing, 2005

  36. [36]

    Large language models in finance: A survey

    Yinheng Li, Shaofei Wang, Han Ding, and Hang Chen. Large language models in finance: A survey . In Proceedings of the fourth ACM international conference on AI in finance, pages 374--382, 2023

  37. [37]

    The Inductive Bias of Restricted f-GANs

    Shuang Liu and Kamalika Chaudhuri. The inductive bias of restricted f-gans . arXiv preprint arXiv:1809.04542, 2018

  38. [38]

    When more experts hurt: Underfitting in multi-expert learning to defer

    Shuqi Liu, Yuzhou Cao, Lei Feng, Bo An, and Luke Ong. When More Experts Hurt: Underfitting in Multi-Expert Learning to Defer . arXiv preprint arXiv:2602.17144, 2026

  39. [39]

    Segment anything in medical images

    Jun Ma, Yuting He, Feifei Li, Lin Han, Chenyu You, and Bo Wang. Segment anything in medical images . Nature communications, 15 0 (1): 0 654, 2024

  40. [40]

    Predict responsibly: improving fairness and accuracy by learning to defer

    David Madras, Toni Pitassi, and Richard Zemel. Predict responsibly: improving fairness and accuracy by learning to defer . Advances in neural information processing systems, 31, 2018

  41. [41]

    Tangobert: Reducing inference cost by using cascaded architecture

    Jonathan Mamou, Oren Pereg, Moshe Wasserblat, and Roy Schwartz. Tangobert: Reducing inference cost by using cascaded architecture . arXiv preprint arXiv:2204.06271, 2022

  42. [42]

    Two-Stage Learning to Defer with Multiple Experts

    Anqi Mao, Christopher Mohri, Mehryar Mohri, and Yutao Zhong. Two-Stage Learning to Defer with Multiple Experts . Advances in Neural Information Processing Systems, 36: 0 3578--3606, 2023

  43. [43]

    Predictor-rejector multi-class abstention: Theoretical analysis and algorithms

    Anqi Mao, Mehryar Mohri, and Yutao Zhong. Predictor-rejector multi-class abstention: Theoretical analysis and algorithms . In International Conference on Algorithmic Learning Theory, pages 822--867. PMLR, 2024 a

  44. [44]

    Principled approaches for learning to defer with multiple experts

    Anqi Mao, Mehryar Mohri, and Yutao Zhong. Principled approaches for learning to defer with multiple experts . In International Workshop on Combinatorial Image Analysis, pages 107--135. Springer, 2024 b

  45. [45]

    Mastering Multiple-Expert Routing: Realizable \ H\ -Consistency and Strong Guarantees for Learning to Defer

    Anqi Mao, Mehryar Mohri, and Yutao Zhong. Mastering Multiple-Expert Routing: Realizable \ H\ -Consistency and Strong Guarantees for Learning to Defer . In Forty-second International Conference on Machine Learning, 2025. URL https://openreview.net/forum?id=2KlxjR6lsd

  46. [46]

    Linking losses for density ratio and class-probability estimation

    Aditya Krishna Menon and Cheng Soon Ong. Linking losses for density ratio and class-probability estimation . In International Conference on Machine Learning, pages 304--313, 2016

  47. [47]

    A loss framework for calibrated anomaly detection

    Aditya Krishna Menon and Robert C Williamson. A loss framework for calibrated anomaly detection . In Proceedings of the 32nd international conference on neural information processing systems, pages 1494--1504, 2018

  48. [48]

    Feynman-Kac Formulae

    Pierre Del Moral. Feynman-Kac Formulae . Springer, 2004

  49. [49]

    Consistent estimators for learning to defer to an expert

    Hussein Mozannar and David Sontag. Consistent estimators for learning to defer to an expert . In International conference on machine learning, pages 7076--7087. PMLR, 2020

  50. [50]

    Who should predict? exact algorithms for learning to defer to humans

    Hussein Mozannar, Hunter Lang, Dennis Wei, Prasanna Sattigeri, Subhro Das, and David Sontag. Who should predict? exact algorithms for learning to defer to humans . In International conference on artificial intelligence and statistics, pages 10520--10545. PMLR, 2023

  51. [51]

    Learning to reject meets long-tail learning

    Harikrishna Narasimhan, Aditya Krishna Menon, Wittawat Jitkrittum, Neha Gupta, and Sanjiv Kumar. Learning to reject meets long-tail learning . In The Twelfth International Conference on Learning Representations, 2024

  52. [52]

    Faster Cascades via Speculative Decoding

    Harikrishna Narasimhan, Wittawat Jitkrittum, Ankit Singh Rawat, Seungyeon Kim, Neha Gupta, Aditya Krishna Menon, and Sanjiv Kumar. Faster Cascades via Speculative Decoding . In The Thirteenth International Conference on Learning Representations, 2025. URL https://openreview.net/forum?id=vo9t20wsmd

  53. [53]

    Jerzy Neyman and Egon S. Pearson. IX. On the problem of the most efficient tests of statistical hypotheses . Philosophical Transactions of the Royal Society of London Series A Containing Papers of a Mathematical or Physical Character, 231: 0 289--337, 1933. doi:10.1098/rsta.1933.0009

  54. [54]

    On the calibration of multiclass classification with rejection

    Chenri Ni, Nontawat Charoenphakdee, Junya Honda, and Masashi Sugiyama. On the calibration of multiclass classification with rejection . Advances in neural information processing systems, 32, 2019

  55. [55]

    A scaled Bregman theorem with applications

    Richard Nock, Aditya Menon, and Cheng Soon Ong. A scaled Bregman theorem with applications . In Advances in Neural Information Processing Systems, volume 29. Curran Associates, Inc., 2016 a

  56. [56]

    A scaled Bregman theorem with applications

    Richard Nock, Aditya Menon, and Cheng Soon Ong. A scaled Bregman theorem with applications . Advances in Neural Information Processing Systems, 29, 2016 b

  57. [57]

    Differentiable learning under triage

    Nastaran Okati, Abir De, and Manuel Gomez-Rodriguez. Differentiable learning under triage . Advances in Neural Information Processing Systems, 34: 0 9140--9151, 2021

  58. [58]

    Change of measure through the Legendre transform

    Antoine Picard-Weibel and Benjamin Guedj. On change of measure inequalities for f -divergences . arXiv preprint arXiv:2202.05568, 2022

  59. [59]

    AUC-based Selective Classification

    Andrea Pugnana and Salvatore Ruggieri. AUC-based Selective Classification . In Francisco Ruiz, Jennifer Dy, and Jan-Willem van de Meent, editors, Proceedings of The 26th International Conference on Artificial Intelligence and Statistics, volume 206 of Proceedings of Machine Learning Research, pages 2494--2514. PMLR, 25--27 Apr 2023 a . URL https://proceed...

  60. [60]

    A Model-Agnostic Heuristics for Selective Classification

    Andrea Pugnana and Salvatore Ruggieri. A Model-Agnostic Heuristics for Selective Classification . Proceedings of the AAAI Conference on Artificial Intelligence, 37 0 (8): 0 9461--9469, Jun. 2023 b . doi:10.1609/aaai.v37i8.26133. URL https://ojs.aaai.org/index.php/AAAI/article/view/26133

  61. [61]

    Consistent algorithms for multiclass classification with an abstain option

    Harish G Ramaswamy, Ambuj Tewari, and Shivani Agarwal. Consistent algorithms for multiclass classification with an abstain option . Electronic Journal of Statistics, 12: 0 530--554, 2018

  62. [62]

    Reid and Robert C

    Mark D. Reid and Robert C. Williamson. Composite Binary Losses . Journal of Machine Learning Research, 11: 0 2387--2422, 2010

  63. [63]

    Information, Divergence and Risk for Binary Experiments

    Mark D Reid and Robert C Williamson. Information, Divergence and Risk for Binary Experiments . Journal of Machine Learning Research, 12: 0 731--817, 2011

  64. [64]

    Pattern recognition and neural networks

    Brian D Ripley. Pattern recognition and neural networks . Cambridge university press, 2007

  65. [65]

    R. T. Rockafellar. Convex Analysis . Princeton University Press, 1970

  66. [66]

    Loss Functions and Operators Generated by f-Divergences

    Vincent Roulet, Tianlin Liu, Nino Vieillard, Michael Eli Sander, and Mathieu Blondel. Loss Functions and Operators Generated by f-Divergences . In Forty-second International Conference on Machine Learning, 2025. URL https://openreview.net/forum?id=V1YfPJDliw

  67. [67]

    A min-max solution of an inventory problem

    Herbert E Scarf, KJ Arrow, and S Karlin. A min-max solution of an inventory problem . Technical report, Rand Corporation Santa Monica, 1957

  68. [68]

    Toward expert-level medical question answering with large language models

    Karan Singhal, Tao Tu, Juraj Gottweis, Rory Sayres, Ellery Wulczyn, Mohamed Amin, Le Hou, Kevin Clark, Stephen R Pfohl, Heather Cole-Lewis, et al. Toward expert-level medical question answering with large language models . Nature medicine, 31 0 (3): 0 943--950, 2025

  69. [69]

    A Connection Between Learning to Reject and Bhattacharyya Divergences

    Alexander Soen. A Connection Between Learning to Reject and Bhattacharyya Divergences . In Geometric Science of Information, pages 369--377. Springer Nature Switzerland, 2026. doi:10.1007/978-3-032-03918-7_38

  70. [70]

    Rejection via Learning Density Ratios

    Alexander Soen, Hisham Husain, Philip Schulz, and Vu Nguyen. Rejection via Learning Density Ratios . In The Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024

  71. [71]

    A Classification Framework for Anomaly Detection

    Ingo Steinwart, Don Hush, and Clint Scovel. A Classification Framework for Anomaly Detection. Journal of Machine Learning Research, 6 0 (2), 2005

  72. [72]

    Sugiyama, T

    Masashi Sugiyama, Taiji Suzuki, Shinichi Nakajima, Hisashi Kashima, Paul von B\" u nau, and Motoaki Kawanabe. Direct importance estimation for covariate shift adaptation . Annals of the Institute of Statistical Mathematics, 60: 0 699--746, 2008. doi:10.1007/s10463-008-0197-x

  73. [73]

    Density ratio estimation in machine learning

    Masashi Sugiyama, Taiji Suzuki, and Takafumi Kanamori. Density ratio estimation in machine learning . Cambridge University Press, 2012

  74. [74]

    High-performance medicine: the convergence of human and artificial intelligence

    Eric J Topol. High-performance medicine: the convergence of human and artificial intelligence . Nature medicine, 25 0 (1): 0 44--56, 2019

  75. [75]

    Model Cascading: Towards Jointly Improving Efficiency and Accuracy of NLP Systems

    Neeraj Varshney and Chitta Baral. Model Cascading: Towards Jointly Improving Efficiency and Accuracy of NLP Systems . In Yoav Goldberg, Zornitsa Kozareva, and Yue Zhang, editors, Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 11007--11021, Abu Dhabi, United Arab Emirates, December 2022. Association for Comput...

  76. [76]

    Calibrated learning to defer with one-vs-all classifiers

    Rajeev Verma and Eric Nalisnick. Calibrated learning to defer with one-vs-all classifiers . In International Conference on Machine Learning, pages 22184--22202. PMLR, 2022

  77. [77]

    Learning to defer to multiple experts: Consistent surrogate losses, confidence calibration, and conformal ensembles

    Rajeev Verma, Daniel Barrej \'o n, and Eric Nalisnick. Learning to defer to multiple experts: Consistent surrogate losses, confidence calibration, and conformal ensembles . In International Conference on Artificial Intelligence and Statistics, pages 11415--11434. PMLR, 2023

  78. [78]

    Viola and M

    P. Viola and M. Jones. Rapid object detection using a boosted cascade of simple features . In Proceedings of the 2001 IEEE Computer Society Conference on Computer Vision and Pattern Recognition. CVPR 2001, volume 1, pages I--I, 2001. doi:10.1109/CVPR.2001.990517

  79. [79]

    Kitani, Yair Movshovitz-Attias, and Elad Eban

    Xiaofang Wang, Dan Kondratyuk, Eric Christiansen, Kris M. Kitani, Yair Movshovitz-Attias, and Elad Eban. Wisdom of Committees: An Overlooked Approach To Faster and More Accurate Models . In International Conference on Learning Representations, 2022. URL https://openreview.net/forum?id=MvO2t0vbs4-

  80. [80]

    MedMNIST Classification Decathlon: A Lightweight AutoML Benchmark for Medical Image Analysis

    Jiancheng Yang, Rui Shi, and Bingbing Ni. MedMNIST Classification Decathlon: A Lightweight AutoML Benchmark for Medical Image Analysis . In IEEE 18th International Symposium on Biomedical Imaging (ISBI), pages 191--195, 2021

Showing first 80 references.