Beyond Augmented-Action Surrogates for Multi-Expert Learning-to-Defer

Axel Carlier; Lai Xing Ng; Wei Tsang Ooi; Yannis Montreuil

arxiv: 2604.09414 · v5 · pith:OGZLHNURnew · submitted 2026-04-10 · 📊 stat.ML · cs.LG

Beyond Augmented-Action Surrogates for Multi-Expert Learning-to-Defer

Yannis Montreuil , Axel Carlier , Lai Xing Ng , Wei Tsang Ooi This is my paper

Pith reviewed 2026-05-21 09:33 UTC · model grok-4.3

classification 📊 stat.ML cs.LG

keywords learning to defermulti-expert systemssurrogate lossesexcess risk boundsdecoupled optimizationcalibration constants

0 comments

The pith

A decoupled surrogate for multi-expert learning-to-defer yields an excess-risk bound whose constant stays fixed as the expert pool grows.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that standard augmented-action training for systems that can defer to any of several experts creates optimization problems such as gradient amplification and starvation even when the method is statistically consistent. It replaces the single shared geometry with a decoupled surrogate that uses one softmax head over the classes and a separate sigmoid head for each expert. Because the two heads are independent, per-sample updates become coordinatewise and the cross-Hessian block between classes and experts is exactly zero. The authors then derive an excess-risk bound whose leading calibration constant is the maximum of two fixed numbers that do not increase with the number of experts, provided the weight per expert is held constant. If the bound holds, practitioners can add more specialists without the theoretical guarantee deteriorating.

Core claim

We replace the augmented-action geometry with a decoupled surrogate consisting of a softmax classifier head and an independent sigmoid head per expert. Per-sample gradients are then coordinatewise and the class-expert Hessian block is identically zero. We prove an excess-risk bound whose calibration constant equals max{2√2, √(2J/λ)} and does not grow with expert pool size J when the per-expert weight λ is held fixed.

What carries the argument

The decoupled surrogate that trains a shared softmax over classes together with a separate sigmoid per expert, making class-expert interactions vanish in the Hessian.

If this is right

The method remains stable when the expert pool is enlarged.
Rare specialists continue to receive deferrals instead of being starved.
Prediction accuracy improves over a standalone classifier on CIFAR-10, CIFAR-10H, and Covertype.
Optimization avoids the winner-take-all and set-mass pathologies of augmented-action surrogates.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same decoupling may extend to settings where experts arrive or leave dynamically without retraining the entire router.
The zero class-expert Hessian block could be exploited for faster second-order optimization methods in large expert pools.
If the bound is tight, real deployments with dozens of specialists would not need to increase regularization as the pool grows.

Load-bearing premise

The weight given to each individual expert stays fixed even while the total number of experts increases.

What would settle it

An empirical plot of excess risk versus expert pool size J, with λ held fixed, that exceeds the claimed calibration constant on a controlled synthetic dataset.

read the original abstract

A learning-to-defer (L2D) system decides, for each input, whether to predict on its own or to hand it to one of several available experts. The very well established recipe trains classifier and router jointly by treating the $K$ classes and $J$ experts as competing actions in one shared $(K{+}J)$-action geometry. Subsequent work has proposed a series of incremental fixes within this geometry; we show that each still suffers, to varying severity, from an optimization-level pathology (target distortion, gradient amplification, winner-take-all starvation, set-mass collapse, or class-expert coupling) even under statistical consistency. We step outside the augmented-action family entirely and propose a decoupled surrogate: a softmax classifier head and an independent sigmoid head per expert, mirroring the two natural objects of the problem. We show that per-sample updates are then coordinatewise and the class-expert Hessian block is identically zero, and prove an excess-risk bound with calibration constant $\max\{2\sqrt{2},\sqrt{2J/\lambda}\}$ -- to our knowledge the first multi-expert L2D guarantee whose constant does not grow with the expert pool when the per-expert weight is held fixed. On controlled synthetic studies and on CIFAR-10, CIFAR-10H, and Covertype, it is the only method in our comparison that remains stable as the expert pool grows, preserves rare specialists, and improves over a standalone classifier on every real-data benchmark.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Decoupled surrogate sidesteps joint-optimization pathologies in multi-expert deferral, but the advertised J-independent bound actually scales with sqrt(J) under fixed lambda.

read the letter

This paper steps outside the usual augmented-action setup for multi-expert learning-to-defer. It replaces the single shared (K+J)-action geometry with a softmax classifier head plus an independent sigmoid head for each expert. The change makes per-sample updates coordinatewise and sets the class-expert Hessian block to zero, which removes several of the listed optimization issues such as gradient amplification, winner-take-all starvation, and class-expert coupling even when the surrogate remains statistically consistent.

Referee Report

1 major / 2 minor

Summary. The paper proposes a decoupled surrogate for multi-expert learning-to-defer that replaces the standard augmented-action geometry with a softmax classifier head and independent per-expert sigmoid heads. It identifies optimization pathologies in prior augmented-action methods, proves an excess-risk bound whose calibration constant is max{2√2, √(2J/λ)}, and claims this is the first multi-expert L2D guarantee whose constant does not grow with expert pool size J when the per-expert weight λ is held fixed. Experiments on synthetic data and real benchmarks (CIFAR-10, CIFAR-10H, Covertype) show improved stability as J grows and consistent gains over a standalone classifier.

Significance. A correctly derived J-independent excess-risk bound would be a meaningful theoretical contribution to L2D, as it would separate the surrogate design from the scaling issues that afflict augmented-action formulations. The empirical demonstration of stability with growing expert pools is a concrete strength. However, the significance is tempered by the apparent tension between the stated bound and the J-independence claim, which is the primary advertised novelty.

major comments (1)

Abstract (and the theorem stating the excess-risk bound): the calibration constant is given explicitly as max{2√2, √(2J/λ)}. Holding λ fixed while J increases causes the second term to scale as √J, so the constant grows with the expert pool. This directly contradicts the claim that the constant 'does not grow with the expert pool when the per-expert weight is held fixed.' Because the novelty of the result is advertised as resting on this J-independence property, the inconsistency is load-bearing for the central theoretical contribution.

minor comments (2)

The abstract and introduction should clarify whether λ is intended to be rescaled with J or whether a different analysis yields a strictly J-independent constant; the current wording leaves this ambiguous.
Figure captions and experimental sections should report the precise values of λ used in each run and whether they were held constant across increasing J.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the careful reading and for highlighting the inconsistency between the stated excess-risk bound and our claim of J-independence. We address the major comment below.

read point-by-point responses

Referee: Abstract (and the theorem stating the excess-risk bound): the calibration constant is given explicitly as max{2√2, √(2J/λ)}. Holding λ fixed while J increases causes the second term to scale as √J, so the constant grows with the expert pool. This directly contradicts the claim that the constant 'does not grow with the expert pool when the per-expert weight is held fixed.' Because the novelty of the result is advertised as resting on this J-independence property, the inconsistency is load-bearing for the central theoretical contribution.

Authors: We agree that the explicit form of the calibration constant, max{2√2, √(2J/λ)}, grows with √J when λ is held fixed. This directly contradicts the J-independence claim made in the abstract and introduction. The inconsistency is an error in our presentation of the result. We will revise the abstract, the theorem statement, and all related discussion to remove the incorrect claim of J-independence under fixed per-expert weight and to accurately describe the scaling behavior of the bound. The decoupled surrogate still yields practical optimization advantages and empirical stability with growing J, but the theoretical novelty must be restated without the erroneous independence claim. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation of excess-risk bound

full rationale

The paper presents a new decoupled surrogate for multi-expert L2D and derives an excess-risk bound whose calibration constant is explicitly stated as max{2√2, √(2J/λ)} under the fixed-λ modeling choice. No load-bearing step reduces by construction to a fitted parameter, self-citation, or redefinition of the target quantity; the bound follows from standard analysis of the proposed coordinatewise updates and zero Hessian blocks. The J-independence claim is tied directly to the stated assumption of holding λ fixed while J grows, rather than emerging tautologically from the inputs. The derivation is self-contained against external benchmarks and does not rely on the enumerated circular patterns.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on standard excess-risk assumptions for surrogate losses in classification plus the modeling decision to hold the per-expert weight fixed; no new particles or dimensions are introduced.

free parameters (1)

λ (per-expert weight)
Held fixed while J grows; appears in the calibration constant √(2J/λ)

axioms (1)

domain assumption Standard assumptions for excess-risk bounds of surrogate losses in multi-class classification
Invoked to obtain the calibration constant max{2√2, √(2J/λ)}

pith-pipeline@v0.9.0 · 5809 in / 1428 out tokens · 39524 ms · 2026-05-21T09:33:54.884482+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Theorem 10 (H-consistency bound ... calibration constant max{2√2, √(2J/λ)} ... for fixed β=λ/J
IndisputableMonolith/Foundation/AlphaCoordinateFixation.lean J_uniquely_calibrated_via_higher_derivative unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Proposition 9 (Per-sample gradient ... ∂Φdec_λ/∂wr = pr−1{r=y}, ∂Φdec_λ/∂sj = (λ/J)(uj−tj)

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Online Learning-to-Defer with Varying Experts
stat.ML 2026-05 unverdicted novelty 8.0

Presents the first online learning-to-defer algorithm with regret bounds O((n + n_e) T^{2/3}) generally and O((n + n_e) sqrt(T)) under low noise for multiclass classification with varying experts.
Online Learning-to-Defer with Varying Experts
stat.ML 2026-05 unverdicted novelty 7.0

Presents the first online Learning-to-Defer algorithm achieving regret O((n + n_e) T^{2/3}) generally and O((n + n_e) sqrt(T)) under low noise for multiclass classification with varying experts.