Adversarial Robustness in One-Stage Learning-to-Defer
Pith reviewed 2026-05-21 20:47 UTC · model grok-4.3
The pith
A new framework secures one-stage learning-to-defer against adversarial attacks on both predictions and deferral decisions.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
We introduce the first framework for adversarial robustness in one-stage L2D, covering both classification and regression. Our approach formalizes attacks, proposes cost-sensitive adversarial surrogate losses, and establishes theoretical guarantees including H, (R, F), and Bayes consistency. Experiments on benchmark datasets confirm that our methods improve robustness against untargeted and targeted attacks while preserving clean performance.
What carries the argument
Cost-sensitive adversarial surrogate losses that jointly optimize the predictor and deferral rule under formal attack models.
If this is right
- Robustness to untargeted and targeted attacks improves in one-stage L2D without degrading clean accuracy.
- The same loss construction applies to both classification and regression deferral problems.
- Theoretical guarantees cover H-consistency, (R,F)-consistency, and Bayes consistency.
- The framework closes the gap left by prior two-stage robustness analyses.
Where Pith is reading between the lines
- The same cost-sensitive construction might extend to settings with multiple experts or sequential deferral decisions.
- Real-world hybrid systems that route safety-critical inputs could adopt the joint-training recipe to limit attack surface.
- Future work could test whether the surrogate losses remain effective when the attack budget varies across different input regions.
Load-bearing premise
The cost-sensitive adversarial surrogate losses can be jointly optimized in the one-stage setting to achieve the stated consistency guarantees.
What would settle it
An explicit counter-example input distribution where the proposed surrogate losses produce a deferral rule that is neither H-consistent nor (R,F)-consistent under the formalized attack model.
read the original abstract
Learning-to-Defer (L2D) enables hybrid decision-making by routing inputs either to a predictor or to external experts. While promising, L2D is highly vulnerable to adversarial perturbations, which can not only flip predictions but also manipulate deferral decisions. Prior robustness analyses focus solely on two-stage settings, leaving open the end-to-end (one-stage) case where predictor and allocation are trained jointly. We introduce the first framework for adversarial robustness in one-stage L2D, covering both classification and regression. Our approach formalizes attacks, proposes cost-sensitive adversarial surrogate losses, and establishes theoretical guarantees including $\mathcal{H}$, $(\mathcal{R }, \mathcal{F})$, and Bayes consistency. Experiments on benchmark datasets confirm that our methods improve robustness against untargeted and targeted attacks while preserving clean performance.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces the first framework for adversarial robustness in one-stage Learning-to-Defer (L2D), covering both classification and regression. It formalizes attacks on the joint predictor-deferral decisions, proposes cost-sensitive adversarial surrogate losses, and claims theoretical guarantees of H-consistency, (R, F)-consistency, and Bayes consistency. Experiments on benchmark datasets are reported to show improved robustness to untargeted and targeted attacks while preserving clean performance.
Significance. If the consistency guarantees are shown to hold under joint one-stage optimization, the work would establish a foundational approach for robust end-to-end L2D systems, extending prior two-stage analyses and providing practical surrogate losses for hybrid decision-making under adversarial conditions.
major comments (2)
- [§4.2, Theorem 3] §4.2, Theorem 3 (H-consistency): the proof appears to extend the two-stage surrogate-loss calibration directly to the joint setting, but the one-stage formulation couples the predictor and deferral parameters through a shared network and single loss; without an explicit re-derivation showing that the adversarial perturbation set and cost matrix preserve the required fixed-point property under joint gradients, the guarantee does not automatically transfer.
- [§4.3] §4.3, (R, F)-consistency claim: the argument relies on the cost-sensitive adversarial loss maintaining Bayes consistency when optimized jointly, yet the manuscript provides no separate analysis of how the perturbation ball interacts with the coupled objective; this is load-bearing for the overall theoretical contribution.
minor comments (2)
- [§5.1] §5.1: the description of the attack generation procedure (PGD steps, epsilon values) could be expanded with explicit pseudocode or parameter tables for reproducibility.
- [Table 2] Table 2: the clean vs. adversarial accuracy columns would benefit from standard-error bars or multiple random seeds to support the reported improvements.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed comments on the theoretical contributions of our manuscript. We address each major comment point by point below, providing clarifications and committing to revisions that strengthen the presentation of the consistency results without altering the core claims.
read point-by-point responses
-
Referee: [§4.2, Theorem 3] §4.2, Theorem 3 (H-consistency): the proof appears to extend the two-stage surrogate-loss calibration directly to the joint setting, but the one-stage formulation couples the predictor and deferral parameters through a shared network and single loss; without an explicit re-derivation showing that the adversarial perturbation set and cost matrix preserve the required fixed-point property under joint gradients, the guarantee does not automatically transfer.
Authors: We appreciate the referee's careful scrutiny of the proof strategy. Theorem 3 establishes H-consistency with respect to the joint hypothesis class that encompasses both the predictor and deferral functions under simultaneous optimization. The adversarial perturbation set is defined over the combined output space of predictions and deferral decisions, and the cost matrix enters the surrogate loss in a manner that preserves the calibration property for the joint objective. Nevertheless, we agree that an explicit re-derivation would improve clarity and rigor. In the revised manuscript we will insert a dedicated supporting lemma immediately preceding Theorem 3 that re-derives the fixed-point property under joint gradient flow, explicitly accounting for the shared network parameters and the interaction between the perturbation ball and the cost-sensitive loss. revision: yes
-
Referee: [§4.3] §4.3, (R, F)-consistency claim: the argument relies on the cost-sensitive adversarial loss maintaining Bayes consistency when optimized jointly, yet the manuscript provides no separate analysis of how the perturbation ball interacts with the coupled objective; this is load-bearing for the overall theoretical contribution.
Authors: We thank the referee for identifying this point. The (R, F)-consistency argument proceeds by showing that any minimizer of the joint adversarial surrogate loss yields the Bayes-optimal combined decision rule under the given cost structure. The perturbation ball is incorporated by taking the supremum over perturbations inside the ball for each input, which is already reflected in the definition of the adversarial risk. We acknowledge, however, that a more granular analysis of how the radius of the ball couples with the shared parameters would make the load-bearing step fully transparent. In the revision we will add a short subsection (or appendix paragraph) that isolates this interaction, deriving an explicit bound on the consistency gap in terms of the perturbation radius and the joint optimization. revision: yes
Circularity Check
No circularity: new one-stage framework and consistency claims derived independently
full rationale
The paper introduces a novel framework for adversarial robustness in one-stage L2D, formalizes attacks on both classification and regression, proposes cost-sensitive adversarial surrogate losses, and establishes H, (R,F), and Bayes consistency guarantees. No quoted equations or sections reduce these guarantees by construction to fitted parameters, internal definitions, or unverified self-citations; the one-stage joint optimization is presented as a direct extension with its own theoretical analysis rather than a renaming or load-bearing reuse of prior two-stage results. The derivation chain remains self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
proposes cost-sensitive adversarial surrogate losses, and establishes theoretical guarantees including H, (R,F), and Bayes consistency
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
one-stage L2D where predictor and allocation are trained jointly
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Convexity, classification, and risk bounds
doi: 10.1198/016214505000000907. Nina L Corvelo Benz and Manuel Gomez Rodriguez. Counterfactual inference of second opinions. InUncertainty in Artificial Intelligence, pages 453–463. PMLR,
-
[2]
14 Adversarial Robustness in One-Stage Learning-to-Defer Noel Codella, Veronica Rotemberg, Philipp Tschandl, M Emre Celebi, Stephen Dusza, David Gutman, Brian Helba, Aadi Kalloo, Konstantinos Liopyris, Michael Marchetti, et al. Skin lesion analysis toward melanoma detection 2018: A challenge hosted by the international skin imaging collaboration (isic).ar...
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[3]
Aritra Ghosh, Himanshu Kumar, and P. S. Sastry. Robust loss functions under label noise for deep neural networks. InProceedings of the Thirty-First AAAI Conference on Artificial Intelligence, AAAI’17, page 1919–1925. AAAI Press,
work page 1919
-
[4]
Explaining and Harnessing Adversarial Examples
Ian J Goodfellow, Jonathon Shlens, and Christian Szegedy. Explaining and harnessing adversarial examples.arXiv preprint arXiv:1412.6572,
work page internal anchor Pith review Pith/arXiv arXiv
-
[5]
Uncovering the limits of adversarial training against norm-bounded adversarial examples,
Sven Gowal, Chongli Qin, Jonathan Uesato, Timothy A. Mann, and Pushmeet Kohli. Uncov- ering the limits of adversarial training against norm-bounded adversarial examples.ArXiv, abs/2010.03593,
-
[6]
Adversarial Examples for Evaluating Reading Comprehension Systems
doi: 10.24963/ijcai.2022/344. URL https://doi.org/10.24963/ijcai.2022/344. Main Track. Robin Jia and Percy Liang. Adversarial examples for evaluating reading comprehension systems.arXiv preprint arXiv:1707.07328,
work page internal anchor Pith review Pith/arXiv arXiv doi:10.24963/ijcai.2022/344 2022
-
[7]
Shalmali Joshi, Sonali Parbhoo, and Finale Doshi-Velez. Learning-to-defer for sequential medical decision-making under uncertainty.arXiv preprint arXiv:2109.06312,
-
[8]
Vijay Keswani, Matthew Lease, and Krishnaram Kenthapadi
URL https://proceedings.neurips.cc/paper_files/paper/2021/file/ 234b941e88b755b7a72a1c1dd5022f30-Paper.pdf. Vijay Keswani, Matthew Lease, and Krishnaram Kenthapadi. Towards unbiased and accurate deferral to multiple experts. InProceedings of the 2021 AAAI/ACM Conference on AI, Ethics, and Society, AIES ’21, page 154–165, New York, NY, USA,
work page 2021
-
[9]
Association 15 Montreuil Carlier, Yu, Ng, Ooi for Computing Machinery. ISBN 9781450384735. doi: 10.1145/3461702.3462516. URL https://doi.org/10.1145/3461702.3462516. Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization.arXiv preprint arXiv:1412.6980,
-
[10]
Towards Deep Learning Models Resistant to Adversarial Attacks
Aleksander Madry, Aleksandar Makelov, Ludwig Schmidt, Dimitris Tsipras, and Adrian Vladu. Towards deep learning models resistant to adversarial attacks.ArXiv, abs/1706.06083,
work page internal anchor Pith review Pith/arXiv arXiv
-
[11]
Anqi Mao, Christopher Mohri, Mehryar Mohri, and Yutao Zhong
URLhttps://api.semanticscholar.org/CorpusID:3488815. Anqi Mao, Christopher Mohri, Mehryar Mohri, and Yutao Zhong. Two-stage learning to defer with multiple experts. InThirty-seventh Conference on Neural Information Processing Systems, 2023a. URLhttps://openreview.net/forum?id=GIlsH0T4b2. Anqi Mao, Mehryar Mohri, and Yutao Zhong. Cross-entropy loss functio...
-
[12]
Two-stage learning-to-defer for multi-task learning, 2024
URL https://openreview.net/forum?id= 2KlxjR6lsd. Yannis Montreuil, Shu Heng Yeo, Axel Carlier, Lai Xing Ng, and Wei Tsang Ooi. Two-stage learning-to-defer for multi-task learning.arXiv preprint arXiv:2410.15729,
-
[13]
Adversarial robustness in two-stage learning-to-defer: Algorithms and guarantees, 2025
Yannis Montreuil, Axel Carlier, Lai Xing Ng, and Wei Tsang Ooi. Adversarial robustness in two-stage learning-to-defer: Algorithms and guarantees.arXiv preprint arXiv:2502.01027, 2025a. 16 Adversarial Robustness in One-Stage Learning-to-Defer Yannis Montreuil, Axel Carlier, Lai Xing Ng, and Wei Tsang Ooi. Why ask one when you can ask k? two-stage learning-...
-
[14]
URL https://proceedings.neurips.cc/paper_files/paper/2022/file/ bc8f76d9caadd48f77025b1c889d2e2d-Paper-Conference.pdf. R A Ohn Aldrich. Fisher and the making of maximum likelihood 1912-1922.Statistical Science, 12(3):162–179,
work page 2022
-
[15]
URL https://openreview.net/forum?id= mkkFubLdNW. Peter Putten. Insurance Company Benchmark (COIL 2000). UCI Machine Learning Repository,
work page 2000
-
[16]
DOI: https://doi.org/10.24432/C5630S. Michael Redmond. Communities and Crime. UCI Machine Learning Repository,
-
[17]
DOI: https://doi.org/10.24432/C53W3X. Ingo Steinwart. How to compare different loss functions and their risks.Constructive Approximation, 26:225–287,
-
[18]
Statistical behavior and consistency of classification methods based on convex risk minimization
doi: 10.1214/aos/1079120130. Zhilu Zhang and Mert Sabuncu. Generalized cross entropy loss for training deep neural networks with noisy labels.Advances in neural information processing systems, 31,
-
[19]
18 Adversarial Robustness in One-Stage Learning-to-Defer Appendix A. Appendix A.1 Important Definitions, Lemmas, and Theorems Definition 19(Symmetric Hypothesis Class).Let A denote the set of possible actions (predictions and deferrals), and let Q be a class of hypotheses q : X → A . We say that Q is symmetricif it is closed under permutations of A, i.e.,...
work page 2023
-
[20]
A.2 Proof of Lemma 8 Lemma 8(Smooth Adversarial Surrogate Losses).Let x∈ X be a clean input, and let ρ > 0and κ > 0be hyperparameters. The smooth adversarial surrogate losses are defined as eΦu cls,s(h, x, j) = Φu cls(h(x)/ρ, j) +κsup x′ j ∈Bp(x,γ) ∆h(x′ j, j)− ∆h(x, j) 2, 19 Montreuil Carlier, Yu, Ng, Ooi ProofFix a target classj∈ A. Define Φρ,u cls (h(x...
work page 2020
-
[21]
The model is set as a linear layer
for 25 epochs. The model is set as a linear layer. As experts, we employ four regression MLP, each focusing on different customer segments (demographics, product ownership, high-value customers) and generate predictions using rules and noise; their accuracies are reported in Appendix A.8.2. The consultation costs are set as follows: β1 = 0 for the main pr...
work page 2000
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.