Why Ask One When You Can Ask $k$? Learning-to-Defer to the Top-$k$ Experts

Axel Carlier; Lai Xing Ng; Wei Tsang Ooi; Yannis Montreuil

arxiv: 2504.12988 · v5 · pith:5FCNQY2Lnew · submitted 2025-04-17 · 💻 cs.LG · stat.ML

Why Ask One When You Can Ask k? Learning-to-Defer to the Top-k Experts

Yannis Montreuil , Axel Carlier , Lai Xing Ng , Wei Tsang Ooi This is my paper

Pith reviewed 2026-05-22 18:59 UTC · model grok-4.3

classification 💻 cs.LG stat.ML

keywords learning to defertop-k deferralmulti-expert systemssurrogate lossselective predictiondecision cascadescost-sensitive learning

0 comments

The pith

A Top-k Learning-to-Defer framework allocates each query to the k most cost-effective experts and unifies prior single-expert methods.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper presents the first framework for Top-k Learning-to-Defer that assigns queries to the k most cost-effective entities rather than limiting to one expert. The approach unifies one-stage and two-stage deferral regimes, selective prediction, and classical cascades while recovering the standard Top-1 rule as a special case. A novel surrogate loss is introduced that stays consistent with optimal decisions and does not depend on the value of k, so one policy works for any number of experts chosen later. The adaptive Top-k(x) variant further learns the right number of experts for each input based on its difficulty and costs. Experiments demonstrate improved accuracy-cost trade-offs over previous single-deferral approaches.

Core claim

The paper claims that by formulating deferral to the top-k experts, queries can be allocated to the k most cost-effective entities in a way that generalizes all prior Learning-to-Defer methods, including one-stage, two-stage, selective prediction, and cascades, with a k-independent Bayes-consistent surrogate loss enabling flexible deployment.

What carries the argument

The Top-k Learning-to-Defer allocation rule that selects the k most cost-effective entities for each query, supported by a k-independent consistent surrogate loss for training.

If this is right

Recovers the usual Top-1 deferral rule as a special case when k equals 1.
Enables principled collaboration with multiple experts when k is greater than 1.
A single learned policy can be deployed flexibly for any chosen k because the surrogate loss does not depend on k.
Delivers superior accuracy-cost trade-offs in both one-stage and two-stage regimes according to the experiments.
The adaptive Top-k(x) variant learns the optimal number of experts per query based on input difficulty, expert quality, and consultation cost.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

This could allow systems to handle varying expert availability by adjusting k dynamically without retraining.
Similar allocation ideas might apply to other multi-agent decision systems where consulting multiple sources has cumulative costs.
Testing on real-world datasets with varying expert costs could reveal practical limits on how large k should be before diminishing returns set in.
Extending the consistency proofs to additional surrogate losses might broaden the framework's use in cost-sensitive applications.

Load-bearing premise

The surrogate loss stays consistent with optimal decisions no matter what value of k is used after the policy is trained.

What would settle it

An experiment on a dataset with known expert accuracies and costs where the trained policy for k=2 fails to achieve lower expected consultation cost than k=1 at the same accuracy level would show the unification does not hold.

read the original abstract

Existing Learning-to-Defer (L2D) frameworks are limited to single-expert deferral, forcing each query to rely on only one expert and preventing the use of collective expertise. We introduce the first framework for Top-$k$ Learning-to-Defer, which allocates queries to the $k$ most cost-effective entities. Our formulation unifies and strictly generalizes prior approaches, including the one-stage and two-stage regimes, selective prediction, and classical cascades. In particular, it recovers the usual Top-1 deferral rule as a special case while enabling principled collaboration with multiple experts when $k>1$. We further propose Top-$k(x)$ Learning-to-Defer, an adaptive variant that learns the optimal number of experts per query based on input difficulty, expert quality, and consultation cost. To enable practical learning, we develop a novel surrogate loss that is Bayes-consistent, $\mathcal{H}_h$-consistent in the one-stage setting, and $(\mathcal{H}_r,\mathcal{H}_g)$-consistent in the two-stage setting. Crucially, this surrogate is independent of $k$, allowing a single policy to be learned once and deployed flexibly across $k$. Experiments across both regimes show that Top-$k$ and Top-$k(x)$ deliver superior accuracy-cost trade-offs, opening a new direction for multi-expert deferral in L2D.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper extends learning-to-defer to top-k routing with a claimed k-independent surrogate that unifies single-expert cases, but the independence needs direct proof checks.

read the letter

This paper's core move is to lift learning-to-defer from single-expert selection to top-k allocation, so a query can be sent to the k most cost-effective experts at once. They also add an adaptive Top-k(x) variant that learns how many experts to consult per input based on difficulty and cost. The unification of one-stage, two-stage, selective prediction, and cascades is clean, and recovering the usual top-1 rule as the k=1 case is a nice sanity check. The k-independent surrogate is the part that could matter in practice: if you can train once and then pick any k at deployment, that removes the need to retrain for different budgets. Experiments reportedly give better accuracy-cost curves than the baselines, which fits the intuition that a small group of experts can cover hard cases without full consultation. The soft spot is exactly the one the stress-test note flags. The surrogate's Bayes consistency and the H_h or (H_r, H_g) consistency claims are load-bearing for the single-policy story, yet it is not obvious from the abstract whether the top-k ranking or cost terms introduce hidden k-dependence in the risk functional. I would want to see the full derivation to confirm the loss stays consistent when k changes after training. Dataset descriptions and error bars are also light in the summary, so the reported gains need a closer look for robustness. This is aimed at people already working on learning-to-defer, human-in-the-loop systems, or cost-sensitive routing. Anyone who already routes queries to experts and wants to tune the number consulted will find concrete value. It deserves a serious referee because the generalization is new and the practical claim is testable even if the proofs need tightening. Send it out for review; the unification and single-policy angle justify the time.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces the first framework for Top-k Learning-to-Defer, generalizing single-expert deferral to allocate each query to the k most cost-effective experts (or entities). It unifies one-stage and two-stage regimes, selective prediction, and classical cascades, recovers Top-1 as a special case, and proposes an adaptive Top-k(x) variant that learns the number of experts per query based on input difficulty, expert quality, and cost. A novel surrogate loss is developed that is claimed to be Bayes-consistent, H_h-consistent in the one-stage setting, and (H_r, H_g)-consistent in the two-stage setting; crucially, this surrogate is independent of k so that a single policy can be learned once and deployed for arbitrary k. Experiments across both regimes report superior accuracy-cost trade-offs.

Significance. If the central claims hold, the work is significant because it extends Learning-to-Defer beyond single-expert settings and supplies a unified formulation together with a practical, k-independent surrogate that enables flexible multi-expert collaboration. The unification of disparate prior regimes and the adaptive Top-k(x) policy are notable strengths; the paper also supplies reproducible experimental comparisons that demonstrate concrete accuracy-cost improvements.

major comments (2)

[§4] §4 (surrogate loss definition and consistency statements): the claim that the surrogate is independent of k and therefore permits a single learned policy for any chosen k is load-bearing for the unification and deployment argument, yet the top-k allocation rule selects the k lowest-cost or highest-quality experts; the consistency proofs must explicitly demonstrate that neither the risk functional nor the surrogate embeds k inside the ranking or selection terms, otherwise the learned policy becomes k-specific.
[§3.2] §3.2 (Top-k(x) adaptive policy): the adaptive variant learns the number of experts per query, but it is unclear whether the same k-independent surrogate remains (H_r, H_g)-consistent when the cardinality is itself a learned function of x; this affects whether the single-policy claim extends to the adaptive case.

minor comments (2)

[§4] Notation for the hypothesis classes H_h, H_r, and H_g should be introduced with explicit definitions before the consistency statements are invoked.
[Experiments] The experimental section should report error bars or standard deviations across runs and provide brief dataset statistics (size, number of experts, cost model) to support the claimed trade-off superiority.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed report. The two major comments raise important points about the k-independence of the surrogate and its consistency properties. We address each below and have revised the manuscript to strengthen the relevant sections and proofs.

read point-by-point responses

Referee: [§4] §4 (surrogate loss definition and consistency statements): the claim that the surrogate is independent of k and therefore permits a single learned policy for any chosen k is load-bearing for the unification and deployment argument, yet the top-k allocation rule selects the k lowest-cost or highest-quality experts; the consistency proofs must explicitly demonstrate that neither the risk functional nor the surrogate embeds k inside the ranking or selection terms, otherwise the learned policy becomes k-specific.

Authors: We appreciate this observation. In the surrogate loss definition (Eq. 4), k does not appear; the loss is a sum over per-expert terms involving only the model's predicted scores, expert qualities, and costs. The Bayes-consistency and H-consistency proofs (Theorems 1–3) proceed by showing that any minimizer of the surrogate risk induces the same expert ranking as the Bayes-optimal policy, where ranking is determined solely by cost-adjusted quality scores. The top-k selection rule is applied only at inference time as a deterministic post-processing step and does not enter the risk or surrogate. We have added a new paragraph and a clarifying remark after Theorem 3 to make this separation explicit. revision: yes
Referee: [§3.2] §3.2 (Top-k(x) adaptive policy): the adaptive variant learns the number of experts per query, but it is unclear whether the same k-independent surrogate remains (H_r, H_g)-consistent when the cardinality is itself a learned function of x; this affects whether the single-policy claim extends to the adaptive case.

Authors: The surrogate itself remains unchanged and k-independent for the adaptive case. The function k(x) is learned by optimizing an auxiliary objective that uses the same surrogate scores to decide cardinality; because the surrogate already encourages correct per-expert ranking, the joint optimization preserves (H_r, H_g)-consistency. We have added a short proof sketch in the revised Appendix C.2 showing that the consistency argument carries over when k(x) belongs to the hypothesis class, and we have updated Section 3.2 to reference this extension. revision: partial

Circularity Check

0 steps flagged

No circularity: surrogate independence from k and unification are by explicit generalization, not reduction to fitted inputs

full rationale

The paper defines a new Top-k deferral framework that explicitly generalizes prior one-stage, two-stage, selective prediction and cascade regimes, recovering Top-1 as the k=1 case. The novel surrogate is constructed to be Bayes-consistent and H-consistent (or (H_r, H_g)-consistent) while stated to contain no k-dependent terms in its risk functional or ranking/cost components; the single-policy deployment claim follows directly from this independence rather than from any fitted parameter or self-referential definition. No equation equates a claimed prediction to a quantity defined inside the same derivation, and no load-bearing consistency proof is shown to embed the target k inside the loss used to learn the policy. The formulation is therefore self-contained against external L2D benchmarks and prior consistency results.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

Abstract supplies no explicit free parameters. The framework rests on the domain assumption that multiple experts with measurable costs and qualities exist and can be ranked per query.

axioms (1)

domain assumption Multiple experts exist with known or estimable per-query consultation costs and prediction qualities that can be used to rank them.
Top-k selection and cost-effectiveness ranking presuppose this expert model.

invented entities (1)

Top-k(x) adaptive deferral policy no independent evidence
purpose: Learns the optimal number of experts to consult for each individual query based on input difficulty, expert quality, and cost.
New adaptive variant introduced to handle varying query difficulty; no independent evidence supplied in abstract.

pith-pipeline@v0.9.0 · 5793 in / 1365 out tokens · 37441 ms · 2026-05-22T18:59:26.383554+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Corollary 9: surrogate family Φu_def,k(π,x,z)=∑j (∑i≠j μi) Φu_01(π,x,j) which is independent of k
IndisputableMonolith/Foundation/BranchSelection.lean branch_selection unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Lemma 10: Bayes-optimal top-k set selects k lowest expected-cost entities

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Coherent Hierarchical Multi-Label Learning to Defer for Medical Imaging
cs.AI 2026-05 unverdicted novelty 7.0

The work defines a Selective-Exclusion handoff contract for hierarchical L2D, proves nodewise Bayes rules can be incoherent, and supplies exact dynamic-programming projection and TBP+RPO that drive incoherence to near...
Optimized Deferral for Imbalanced Settings
cs.LG 2026-04 unverdicted novelty 5.0

MILD reformulates two-stage learning to defer as cost-sensitive learning over the input-expert domain and derives new margin-based losses with guarantees, yielding better performance than baselines on image classifica...