Why Ask One When You Can Ask k? Learning-to-Defer to the Top-k Experts
Pith reviewed 2026-05-22 18:59 UTC · model grok-4.3
The pith
A Top-k Learning-to-Defer framework allocates each query to the k most cost-effective experts and unifies prior single-expert methods.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The paper claims that by formulating deferral to the top-k experts, queries can be allocated to the k most cost-effective entities in a way that generalizes all prior Learning-to-Defer methods, including one-stage, two-stage, selective prediction, and cascades, with a k-independent Bayes-consistent surrogate loss enabling flexible deployment.
What carries the argument
The Top-k Learning-to-Defer allocation rule that selects the k most cost-effective entities for each query, supported by a k-independent consistent surrogate loss for training.
If this is right
- Recovers the usual Top-1 deferral rule as a special case when k equals 1.
- Enables principled collaboration with multiple experts when k is greater than 1.
- A single learned policy can be deployed flexibly for any chosen k because the surrogate loss does not depend on k.
- Delivers superior accuracy-cost trade-offs in both one-stage and two-stage regimes according to the experiments.
- The adaptive Top-k(x) variant learns the optimal number of experts per query based on input difficulty, expert quality, and consultation cost.
Where Pith is reading between the lines
- This could allow systems to handle varying expert availability by adjusting k dynamically without retraining.
- Similar allocation ideas might apply to other multi-agent decision systems where consulting multiple sources has cumulative costs.
- Testing on real-world datasets with varying expert costs could reveal practical limits on how large k should be before diminishing returns set in.
- Extending the consistency proofs to additional surrogate losses might broaden the framework's use in cost-sensitive applications.
Load-bearing premise
The surrogate loss stays consistent with optimal decisions no matter what value of k is used after the policy is trained.
What would settle it
An experiment on a dataset with known expert accuracies and costs where the trained policy for k=2 fails to achieve lower expected consultation cost than k=1 at the same accuracy level would show the unification does not hold.
read the original abstract
Existing Learning-to-Defer (L2D) frameworks are limited to single-expert deferral, forcing each query to rely on only one expert and preventing the use of collective expertise. We introduce the first framework for Top-$k$ Learning-to-Defer, which allocates queries to the $k$ most cost-effective entities. Our formulation unifies and strictly generalizes prior approaches, including the one-stage and two-stage regimes, selective prediction, and classical cascades. In particular, it recovers the usual Top-1 deferral rule as a special case while enabling principled collaboration with multiple experts when $k>1$. We further propose Top-$k(x)$ Learning-to-Defer, an adaptive variant that learns the optimal number of experts per query based on input difficulty, expert quality, and consultation cost. To enable practical learning, we develop a novel surrogate loss that is Bayes-consistent, $\mathcal{H}_h$-consistent in the one-stage setting, and $(\mathcal{H}_r,\mathcal{H}_g)$-consistent in the two-stage setting. Crucially, this surrogate is independent of $k$, allowing a single policy to be learned once and deployed flexibly across $k$. Experiments across both regimes show that Top-$k$ and Top-$k(x)$ deliver superior accuracy-cost trade-offs, opening a new direction for multi-expert deferral in L2D.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces the first framework for Top-k Learning-to-Defer, generalizing single-expert deferral to allocate each query to the k most cost-effective experts (or entities). It unifies one-stage and two-stage regimes, selective prediction, and classical cascades, recovers Top-1 as a special case, and proposes an adaptive Top-k(x) variant that learns the number of experts per query based on input difficulty, expert quality, and cost. A novel surrogate loss is developed that is claimed to be Bayes-consistent, H_h-consistent in the one-stage setting, and (H_r, H_g)-consistent in the two-stage setting; crucially, this surrogate is independent of k so that a single policy can be learned once and deployed for arbitrary k. Experiments across both regimes report superior accuracy-cost trade-offs.
Significance. If the central claims hold, the work is significant because it extends Learning-to-Defer beyond single-expert settings and supplies a unified formulation together with a practical, k-independent surrogate that enables flexible multi-expert collaboration. The unification of disparate prior regimes and the adaptive Top-k(x) policy are notable strengths; the paper also supplies reproducible experimental comparisons that demonstrate concrete accuracy-cost improvements.
major comments (2)
- [§4] §4 (surrogate loss definition and consistency statements): the claim that the surrogate is independent of k and therefore permits a single learned policy for any chosen k is load-bearing for the unification and deployment argument, yet the top-k allocation rule selects the k lowest-cost or highest-quality experts; the consistency proofs must explicitly demonstrate that neither the risk functional nor the surrogate embeds k inside the ranking or selection terms, otherwise the learned policy becomes k-specific.
- [§3.2] §3.2 (Top-k(x) adaptive policy): the adaptive variant learns the number of experts per query, but it is unclear whether the same k-independent surrogate remains (H_r, H_g)-consistent when the cardinality is itself a learned function of x; this affects whether the single-policy claim extends to the adaptive case.
minor comments (2)
- [§4] Notation for the hypothesis classes H_h, H_r, and H_g should be introduced with explicit definitions before the consistency statements are invoked.
- [Experiments] The experimental section should report error bars or standard deviations across runs and provide brief dataset statistics (size, number of experts, cost model) to support the claimed trade-off superiority.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed report. The two major comments raise important points about the k-independence of the surrogate and its consistency properties. We address each below and have revised the manuscript to strengthen the relevant sections and proofs.
read point-by-point responses
-
Referee: [§4] §4 (surrogate loss definition and consistency statements): the claim that the surrogate is independent of k and therefore permits a single learned policy for any chosen k is load-bearing for the unification and deployment argument, yet the top-k allocation rule selects the k lowest-cost or highest-quality experts; the consistency proofs must explicitly demonstrate that neither the risk functional nor the surrogate embeds k inside the ranking or selection terms, otherwise the learned policy becomes k-specific.
Authors: We appreciate this observation. In the surrogate loss definition (Eq. 4), k does not appear; the loss is a sum over per-expert terms involving only the model's predicted scores, expert qualities, and costs. The Bayes-consistency and H-consistency proofs (Theorems 1–3) proceed by showing that any minimizer of the surrogate risk induces the same expert ranking as the Bayes-optimal policy, where ranking is determined solely by cost-adjusted quality scores. The top-k selection rule is applied only at inference time as a deterministic post-processing step and does not enter the risk or surrogate. We have added a new paragraph and a clarifying remark after Theorem 3 to make this separation explicit. revision: yes
-
Referee: [§3.2] §3.2 (Top-k(x) adaptive policy): the adaptive variant learns the number of experts per query, but it is unclear whether the same k-independent surrogate remains (H_r, H_g)-consistent when the cardinality is itself a learned function of x; this affects whether the single-policy claim extends to the adaptive case.
Authors: The surrogate itself remains unchanged and k-independent for the adaptive case. The function k(x) is learned by optimizing an auxiliary objective that uses the same surrogate scores to decide cardinality; because the surrogate already encourages correct per-expert ranking, the joint optimization preserves (H_r, H_g)-consistency. We have added a short proof sketch in the revised Appendix C.2 showing that the consistency argument carries over when k(x) belongs to the hypothesis class, and we have updated Section 3.2 to reference this extension. revision: partial
Circularity Check
No circularity: surrogate independence from k and unification are by explicit generalization, not reduction to fitted inputs
full rationale
The paper defines a new Top-k deferral framework that explicitly generalizes prior one-stage, two-stage, selective prediction and cascade regimes, recovering Top-1 as the k=1 case. The novel surrogate is constructed to be Bayes-consistent and H-consistent (or (H_r, H_g)-consistent) while stated to contain no k-dependent terms in its risk functional or ranking/cost components; the single-policy deployment claim follows directly from this independence rather than from any fitted parameter or self-referential definition. No equation equates a claimed prediction to a quantity defined inside the same derivation, and no load-bearing consistency proof is shown to embed the target k inside the loss used to learn the policy. The formulation is therefore self-contained against external L2D benchmarks and prior consistency results.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Multiple experts exist with known or estimable per-query consultation costs and prediction qualities that can be used to rank them.
invented entities (1)
-
Top-k(x) adaptive deferral policy
no independent evidence
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Corollary 9: surrogate family Φu_def,k(π,x,z)=∑j (∑i≠j μi) Φu_01(π,x,j) which is independent of k
-
IndisputableMonolith/Foundation/BranchSelection.leanbranch_selection unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Lemma 10: Bayes-optimal top-k set selects k lowest expected-cost entities
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 2 Pith papers
-
Coherent Hierarchical Multi-Label Learning to Defer for Medical Imaging
The work defines a Selective-Exclusion handoff contract for hierarchical L2D, proves nodewise Bayes rules can be incoherent, and supplies exact dynamic-programming projection and TBP+RPO that drive incoherence to near...
-
Optimized Deferral for Imbalanced Settings
MILD reformulates two-stage learning to defer as cost-sensitive learning over the input-expert domain and derives new margin-based losses with guarantees, yielding better performance than baselines on image classifica...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.