arxiv: 2604.09414 · v2 · submitted 2026-04-10 · 📊 stat.ML · cs.LG

Recognition: unknown

Beyond Augmented-Action Surrogates for Multi-Expert Learning-to-Defer

Yannis Montreuil , Axel Carlier , Lai Xing Ng , Wei Tsang Ooi

Authors on Pith no claims yet

Pith reviewed 2026-05-10 16:36 UTC · model grok-4.3

classification 📊 stat.ML cs.LG

keywords multi-expert learning to defersurrogate lossdecoupled estimationexpert redundancygradient pathologyH-consistencyclassification deferral

0 comments

The pith

A decoupled surrogate for multi-expert deferral separates class posteriors from expert utilities to eliminate gradient pathologies.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Existing surrogates for learning when to defer to one of many experts treat classes and experts as joint actions inside a single augmented space. This produces statistical consistency at the population level but creates training-time problems: gradients can amplify errors when experts overlap, starve rare specialists, or couple decisions so that adding experts degrades overall performance. The paper proposes instead to estimate class probabilities with a standard softmax while modeling each expert's utility with its own independent sigmoid. This split yields an H-consistency bound whose constant does not grow with the number of experts and produces gradients free of the earlier amplification, starvation, and coupling effects. On synthetic benchmarks, CIFAR-10, CIFAR-10H, and Covertype the decoupled method is the only one that avoids these failures while still beating a standalone classifier in every setting.

Core claim

The paper claims that the root cause of underfitting, expert suppression, and degradation with pool size in current multi-expert deferral methods is the joint augmented-action geometry, and that replacing it with a decoupled surrogate—softmax for class posteriors and independent sigmoids for per-expert utilities—removes amplification under redundancy, preserves rare specialists, and delivers an H-consistency guarantee whose leading constant is independent of expert count J when the per-expert weight is held fixed at lambda/J.

What carries the argument

The decoupled surrogate that estimates the class posterior via softmax while treating each expert's deferral utility with a separate sigmoid.

If this is right

The H-consistency bound constant stays fixed even as the expert pool size J increases, provided per-expert weight beta equals lambda/J.
Rare specialists remain active rather than being starved when many redundant experts are added.
Performance improves over a standalone classifier in every tested regime instead of degrading.
Training gradients no longer amplify errors under expert redundancy or couple class and deferral decisions.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Dynamic expert pools could be grown or pruned without retraining the entire system to prevent suppression of useful members.
The same separation of posterior and utility heads might stabilize training in other multi-agent selection settings where both consistency and gradient balance are required.
If expert correlations turn out to be stronger than those in the tested benchmarks, an extra regularization term on the sigmoids may still be needed to keep the independence assumption valid.

Load-bearing premise

That separating class-posterior estimation from per-expert utility estimation removes the amplification, starvation, and coupling pathologies without introducing new failure modes when real experts have correlated errors.

What would settle it

An experiment on a dataset where experts share highly correlated error patterns in which the decoupled surrogate begins to suppress a useful specialist or stops improving over the base classifier.

read the original abstract

Existing multi-expert learning-to-defer surrogates are statistically consistent, yet they can underfit, suppress useful experts, or degrade as the expert pool grows. We trace these failures to a shared architectural choice: casting classes and experts as actions inside one augmented prediction geometry. Consistency governs the population target; it says nothing about how the surrogate distributes gradient mass during training. We analyze five surrogates along both axes and show that each trades a fix on one for a failure on the other. We then introduce a decoupled surrogate that estimates the class posterior with a softmax and each expert utility with an independent sigmoid. It admits an $\mathcal{H}$-consistency bound whose constant is $J$-independent for fixed per-expert weight $\beta{=}\lambda/J$, and its gradients are free of the amplification, starvation, and coupling pathologies of the augmented family. Experiments on synthetic benchmarks, CIFAR-10, CIFAR-10H, and Covertype confirm that the decoupled surrogate is the only method that avoids amplification under redundancy, preserves rare specialists, and consistently improves over a standalone classifier across all settings.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

0 major / 2 minor

Summary. The manuscript claims that existing augmented-action surrogates for multi-expert learning-to-defer suffer from gradient pathologies (amplification under redundancy, starvation of rare specialists, and class-expert coupling) despite statistical consistency, because they embed classes and experts in a single augmented action space. It introduces a decoupled surrogate that estimates the class posterior via softmax and each expert's utility via an independent sigmoid; this admits an H-consistency bound whose constant is independent of the number of experts J when the per-expert weight is fixed at β=λ/J, and whose gradients avoid the listed pathologies. Experiments on synthetic data, CIFAR-10, CIFAR-10H, and Covertype show that the decoupled method is the only one that avoids amplification, preserves specialists, and consistently beats a standalone classifier.

Significance. If the H-consistency result in Section 3 and the gradient analysis in Section 4 are correct, the work supplies both a population-level guarantee and an optimization-level fix for scaling deferral systems to heterogeneous expert pools. The empirical demonstration that only the decoupled surrogate improves over the base classifier across all tested regimes (including redundancy and rarity) is a concrete practical contribution. The combination of a J-independent bound and pathology-free gradients addresses a gap between consistency theory and training dynamics that prior surrogates left open.

minor comments (2)

The abstract and introduction refer to an analysis of five augmented surrogates; the main text should list their explicit forms (loss functions and architectures) in Section 2 so readers can directly compare the gradient pathologies discussed in Section 4.
Section 5 (experiments) reports consistent gains but omits the precise data splits, expert simulation protocol, and hyperparameter ranges used for each baseline; adding these would strengthen reproducibility without altering the central claims.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the positive review, the recognition of the theoretical contribution of the J-independent H-consistency bound, and the recommendation for minor revision. We are pleased that the practical value of the decoupled surrogate in avoiding gradient pathologies is acknowledged.

Circularity Check

0 steps flagged

No significant circularity detected in derivation chain

full rationale

The paper's core derivation in Section 3 establishes an H-consistency bound for the decoupled surrogate whose constant is independent of J under the explicit design choice β=λ/J; this scaling is stated as a fixed parameter choice rather than a fitted or self-defined quantity. Gradient analysis in Section 4 identifies amplification/starvation/coupling issues in augmented-action surrogates via direct examination of the loss gradients, without reducing to a post-hoc fit or renaming of known results. No load-bearing self-citations, uniqueness theorems imported from prior author work, or ansatzes smuggled via citation are present in the central claims. The separation of softmax class posteriors from independent sigmoid expert utilities is a structural choice whose population guarantees are derived independently of the experimental outcomes, making the overall chain self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on standard supervised learning assumptions plus the specific design choice of independent sigmoid heads; no new entities are postulated.

free parameters (1)

per-expert weight beta
Fixed as lambda over J to obtain the J-independent bound; this scaling is chosen by the authors rather than derived from data.

axioms (1)

domain assumption The surrogate loss is H-consistent under standard multiclass assumptions
The consistency bound is stated to hold; the abstract does not derive the base consistency from first principles.

pith-pipeline@v0.9.0 · 5502 in / 1423 out tokens · 56108 ms · 2026-05-10T16:36:12.150652+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

31 extracted references · 4 canonical work pages

[1]

Multi-class h-consistency bounds

Pranjal Awasthi, Anqi Mao, Mehryar Mohri, and Yutao Zhong. Multi-class h-consistency bounds. In Proceedings of the 36th International Conference on Neural Information Processing Systems, NIPS '22, Red Hook, NY, USA, 2022. Curran Associates Inc. ISBN 9781713871088

2022
[2]

Convexity, classification, and risk bounds

Peter Bartlett, Michael Jordan, and Jon McAuliffe. Convexity, classification, and risk bounds. Journal of the American Statistical Association, 101: 0 138--156, 02 2006. doi:10.1198/016214505000000907

work page doi:10.1198/016214505000000907 2006
[3]

Bartlett and Marten H

Peter L. Bartlett and Marten H. Wegkamp. Classification with a reject option using a hinge loss. The Journal of Machine Learning Research, 9: 0 1823–1840, June 2008

2008
[4]

Generalizing consistent multi-class classification with rejection to be compatible with arbitrary losses

Yuzhou Cao, Tianchi Cai, Lei Feng, Lihong Gu, Jinjie Gu, Bo An, Gang Niu, and Masashi Sugiyama. Generalizing consistent multi-class classification with rejection to be compatible with arbitrary losses. Advances in neural information processing systems, 35: 0 521--534, 2022

2022
[5]

In defense of softmax parametrization for calibrated and consistent learning to defer

Yuzhou Cao, Hussein Mozannar, Lei Feng, Hongxin Wei, and Bo An. In defense of softmax parametrization for calibrated and consistent learning to defer. In Proceedings of the 37th International Conference on Neural Information Processing Systems, NIPS '23, Red Hook, NY, USA, 2024. Curran Associates Inc

2024
[6]

Sample efficient learning of predictors that complement humans, 2022

Mohammad-Amin Charusaie, Hussein Mozannar, David Sontag, and Samira Samadi. Sample efficient learning of predictors that complement humans, 2022

2022
[7]

C. Chow. On optimum recognition error and reject tradeoff. IEEE Transactions on Information Theory, 16 0 (1): 0 41–46, January 1970. doi:10.1109/TIT.1970.1054406

work page doi:10.1109/tit.1970.1054406 1970
[8]

Learning with rejection

Corinna Cortes, Giulia DeSalvo, and Mehryar Mohri. Learning with rejection. In Ronald Ortner, Hans Ulrich Simon, and Sandra Zilles, editors, Algorithmic Learning Theory, pages 67--82, Cham, 2016. Springer International Publishing. ISBN 978-3-319-46379-7

2016
[9]

Cardinality-aware set prediction and top-\ k\ classification

Corinna Cortes, Anqi Mao, Christopher Mohri, Mehryar Mohri, and Yutao Zhong. Cardinality-aware set prediction and top-\ k\ classification. In The Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024. URL https://openreview.net/forum?id=WAT3qu737X

2024
[10]

Selective classification for deep neural networks

Yonatan Geifman and Ran El-Yaniv. Selective classification for deep neural networks. In I. Guyon, U. Von Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc., 2017. URL https://proceedings.neurips.cc/paper_files/paper/2017/file/4a8423d5e91...

2017
[11]

Robust loss functions under label noise for deep neural networks

Aritra Ghosh, Himanshu Kumar, and P Shanti Sastry. Robust loss functions under label noise for deep neural networks. In Proceedings of the AAAI conference on artificial intelligence, volume 31, 2017

2017
[12]

Mitigating underfitting in learning to defer with consistent losses

Shuqi Liu, Yuzhou Cao, Qiaozhen Zhang, Lei Feng, and Bo An. Mitigating underfitting in learning to defer with consistent losses. In International Conference on Artificial Intelligence and Statistics, pages 4816--4824. PMLR, 2024

2024
[13]

When more experts hurt: Underfitting in multi-expert learning to defer

Shuqi Liu, Yuzhou Cao, Lei Feng, Bo An, and Luke Ong. When more experts hurt: Underfitting in multi-expert learning to defer. arXiv preprint arXiv:2602.17144, 2026

work page arXiv 2026
[14]

Predict responsibly: improving fairness and accuracy by learning to defer

David Madras, Toni Pitassi, and Richard Zemel. Predict responsibly: improving fairness and accuracy by learning to defer. Advances in neural information processing systems, 31, 2018

2018
[15]

Cross-entropy loss functions: Theoretical analysis and applications

Anqi Mao, Mehryar Mohri, and Yutao Zhong. Cross-entropy loss functions: Theoretical analysis and applications. In International conference on Machine learning, pages 23803--23828. PMLR, 2023

2023
[16]

Predictor-rejector multi-class abstention: Theoretical analysis and algorithms

Anqi Mao, Mehryar Mohri, and Yutao Zhong. Predictor-rejector multi-class abstention: Theoretical analysis and algorithms. In International Conference on Algorithmic Learning Theory, pages 822--867. PMLR, 2024 a

2024
[17]

h -consistency bounds: Characterization and extensions

Anqi Mao, Mehryar Mohri, and Yutao Zhong. h -consistency bounds: Characterization and extensions. Advances in Neural Information Processing Systems, 36, 2024 b

2024
[18]

Principled approaches for learning to defer with multiple experts

Anqi Mao, Mehryar Mohri, and Yutao Zhong. Principled approaches for learning to defer with multiple experts. In ISAIM, 2024 c

2024
[19]

Mastering multiple-expert routing: Realizable \ h\ -consistency and strong guarantees for learning to defer

Anqi Mao, Mehryar Mohri, and Yutao Zhong. Mastering multiple-expert routing: Realizable \ h\ -consistency and strong guarantees for learning to defer. In Forty-second International Conference on Machine Learning, 2025

2025
[20]

Foundations of machine learning

Mehryar Mohri, Afshin Rostamizadeh, and Ameet Talwalkar. Foundations of machine learning. MIT Press, 2012

2012
[21]

A two-stage learning-to-defer approach for multi-task learning

Yannis Montreuil, Yeo Shu Heng, Axel Carlier, Lai Xing Ng, and Wei Tsang Ooi. A two-stage learning-to-defer approach for multi-task learning. In Forty-second International Conference on Machine Learning, 2025

2025
[22]

Why ask one when you can ask \ k\ ? learning-to-defer to the top-\ k\ experts

Yannis Montreuil, Axel Carlier, Lai Xing Ng, and Wei Tsang Ooi. Why ask one when you can ask \ k\ ? learning-to-defer to the top-\ k\ experts. In The Fourteenth International Conference on Learning Representations, 2026. URL https://openreview.net/forum?id=mGbEv4kVoG

2026
[23]

Consistent estimators for learning to defer to an expert

Hussein Mozannar and David Sontag. Consistent estimators for learning to defer to an expert. In Proceedings of the 37th International Conference on Machine Learning, ICML'20. JMLR.org, 2020

2020
[24]

Hussein Mozannar, Hunter Lang, Dennis Wei, Prasanna Sattigeri, Subhro Das, and David A. Sontag. Who should predict? exact algorithms for learning to defer to humans. In International Conference on Artificial Intelligence and Statistics, 2023. URL https://api.semanticscholar.org/CorpusID:255941521

2023
[25]

Post-hoc estimators for learning to defer to an expert

Harikrishna Narasimhan, Wittawat Jitkrittum, Aditya K Menon, Ankit Rawat, and Sanjiv Kumar. Post-hoc estimators for learning to defer to an expert. In S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh, editors, Advances in Neural Information Processing Systems, volume 35, pages 29292--29304. Curran Associates, Inc., 2022. URL https://proce...

2022
[26]

On the calibration of multiclass classification with rejection

Chenri Ni, Nontawat Charoenphakdee, Junya Honda, and Masashi Sugiyama. On the calibration of multiclass classification with rejection. Advances in neural information processing systems, 32, 2019

2019
[27]

Human uncertainty makes classification more robust

Joshua C Peterson, Ruairidh M Battleday, Thomas L Griffiths, and Olga Russakovsky. Human uncertainty makes classification more robust. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 9617--9626, 2019

2019
[28]

How to compare different loss functions and their risks

Ingo Steinwart. How to compare different loss functions and their risks. Constructive Approximation, 26: 0 225--287, 2007. URL https://api.semanticscholar.org/CorpusID:16660598

2007
[29]

Learning to defer to multiple experts: Consistent surrogate losses, confidence calibration, and conformal ensembles

Rajeev Verma, Daniel Barrejon, and Eric Nalisnick. Learning to defer to multiple experts: Consistent surrogate losses, confidence calibration, and conformal ensembles. In International Conference on Artificial Intelligence and Statistics, 2022. URL https://api.semanticscholar.org/CorpusID:253237048

2022
[30]

URL https: //doi.org/10.1214/aos/1079120141

Tong Zhang. Statistical behavior and consistency of classification methods based on convex risk minimization. Annals of Statistics, 32, 12 2002. doi:10.1214/aos/1079120130

work page doi:10.1214/aos/1079120130 2002
[31]

Generalized cross entropy loss for training deep neural networks with noisy labels

Zhilu Zhang and Mert Sabuncu. Generalized cross entropy loss for training deep neural networks with noisy labels. Advances in neural information processing systems, 31, 2018

2018