Coherent Hierarchical Multi-Label Learning to Defer for Medical Imaging

Alison Noble; Emma Sun; Helen Higham; Joshua Strong; Pramit Saha

arxiv: 2605.02734 · v1 · submitted 2026-05-04 · 💻 cs.AI

Coherent Hierarchical Multi-Label Learning to Defer for Medical Imaging

Joshua Strong , Pramit Saha , Emma Sun , Helen Higham , Alison Noble This is my paper

Pith reviewed 2026-05-08 18:44 UTC · model grok-4.3

classification 💻 cs.AI

keywords learning to deferhierarchical multi-labelmedical imagingcoherent deferralselective-exclusion contractbayes-optimal deferraltaxonomic belief propagationaction incoherence

0 comments

The pith

Hierarchical medical imaging deferral requires a selective-exclusion contract to prevent contradictory delegation actions.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that learning to defer in hierarchical multi-label settings for medical imaging produces inconsistent actions when decisions are made independently at each node. It formalizes coherent hierarchical deferral through a selective-exclusion handoff contract and shows that even Bayes-optimal per-node rules can violate taxonomic consistency, delegation rules, or assert labels while deferring their implications. Two remedies follow: an exact dynamic-programming projection over coherent actions and a recursive joint model using taxonomic belief propagation with policy optimization. If these hold, AI systems can delegate to experts in clinical taxonomies without creating contradictions that flat methods allow.

Core claim

We formalise coherent hierarchical deferral under a Selective-Exclusion handoff contract, characterise the Bayes-optimal coherent deferral rule, and show that even nodewise Bayes L2D can be action-incoherent. Projection removes it exactly, and fast TBP+RPO drives incoherence near zero while retaining strong utility.

What carries the argument

The Selective-Exclusion handoff contract, which enforces that deferral of one label in the hierarchy excludes independent assertion or deferral of related parent or child labels to maintain consistency.

Load-bearing premise

The selective-exclusion handoff contract must accurately reflect the delegation rules clinicians actually require in medical imaging workflows.

What would settle it

A controlled study comparing error rates and clinician override frequency when using coherent versus nodewise-incoherent deferral outputs on the same hierarchical medical cases.

Figures

Figures reproduced from arXiv: 2605.02734 by Alison Noble, Emma Sun, Helen Higham, Joshua Strong, Pramit Saha.

**Figure 1.** Figure 1: Illustrative subtree used throughout Section 3. The label space obeys the standard hierarchy constraint of upward implication: for every edge (𝑝 → 𝑐) ∈ ℰ, 𝑦𝑐 = 1 =⇒ 𝑦𝑝 = 1. (2) Its contrapositive gives downward implication: 𝑦𝑝 = 0 =⇒ 𝑦𝑐 = 0. We also adopt the open-world assumption 𝑦𝑝 = 1 ⇏ ∃𝑐 ∈ 𝐶(𝑝) such that 𝑦𝑐 = 1. (3) Thus, a positive parent need not be fully explained by the listed children. 3.2 Taxono… view at source ↗

**Figure 2.** Figure 2: Representative VinDr-CXR deferral curves. Mean over 5 seeds when deferring to radiologist R9. (a) System F1 score across the node-level deferral budget. (b) Any incoherence rate over the same budget sweep. We test three theory-derived claims. First, independent nodewise L2D should produce deferral-specific incoherence: Proposition 3 shows that even oracle nodewise Bayes decisions can yield delegation vio… view at source ↗

**Figure 3.** Figure 3: Annotation-compatible chest X-ray taxonomy for the VinDr-CXR dataset. Problematic raw-to-raw implication edges from the original taxonomy are replaced by synthetic internal grouping nodes. The raw source taxonomy contains a No finding node under ROOT, but all reported experiments exclude No finding from training and evaluation. 49 view at source ↗

**Figure 4.** Figure 4: Taxonomy utilised for the CheXpert dataset. 50 view at source ↗

**Figure 5.** Figure 5: Taxonomy utilised for the PadChest dataset. 51 view at source ↗

**Figure 6.** Figure 6: Taxonomy utilised for the ADPv2 dataset view at source ↗

**Figure 7.** Figure 7: CheXpert multiple-expert results. (A) Overall AU-SysF1-label for eight human/AI expert combinations across 15 runs per combination (5 readers × 3 seeds); points show means and bars show ±1 SD. (B) Mean label-level deferral curves comparing Human only and Human + CARZero + KAD + CheXzero; stars mark the peak of each mean curve. 52 view at source ↗

read the original abstract

Learning to Defer (L2D) enables a model to predict autonomously or defer to an expert, but prior work largely assumes flat label spaces. We study the first L2D setting with hierarchical multi-label decisions, motivated by medical-imaging workflows in which findings are organised by clinical taxonomies. In this setting, deferral is a delegation action rather than a label assignment, so treating it as an independent per-label decision can produce deferral incoherence, including taxonomic contradictions, delegation violations, and deferrals of labels already implied by the model's own assertions. We formalise coherent hierarchical deferral under a Selective-Exclusion handoff contract, characterise the Bayes-optimal coherent deferral rule, and show that even nodewise Bayes L2D can be action-incoherent. We then propose two remedies: exact coherent projection, a dynamic-programming decoder over the coherent action set, and Taxonomic Belief Propagation (TBP) with Recursive Policy Optimisation (RPO), a contract-aware joint action model trained through the same recursion used at inference. Across real-reader and controlled-expert medical-imaging benchmarks, naive binary-relevance L2D exhibits non-trivial incoherence. Projection removes it exactly, and fast TBP+RPO drives incoherence near zero while retaining strong utility.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper formalizes coherent deferral for hierarchical multi-label L2D under a Selective-Exclusion contract and gives two workable fixes, but the contract itself is an untested modeling choice.

read the letter

The main point is that this is the first treatment of learning to defer when labels sit in a taxonomy rather than a flat set. Nodewise decisions can create contradictions, such as deferring a parent label while asserting a child or violating delegation rules. The authors define a Selective-Exclusion handoff contract, characterize the Bayes-optimal coherent rule, and show that even nodewise Bayes L2D can be action-incoherent under it.

Referee Report

1 major / 1 minor

Summary. The paper studies Learning to Defer (L2D) in hierarchical multi-label settings for medical imaging. It introduces a Selective-Exclusion handoff contract to define coherent deferral actions, characterizes the Bayes-optimal coherent deferral rule, shows that nodewise Bayes L2D can produce action-incoherent decisions (taxonomic contradictions, delegation violations), and proposes two remedies: an exact dynamic-programming projection decoder over the coherent action set, and Taxonomic Belief Propagation (TBP) combined with Recursive Policy Optimisation (RPO) that trains a joint action model using the same recursion. On real-reader and controlled-expert medical-imaging benchmarks, naive binary-relevance L2D shows non-trivial incoherence; projection eliminates it exactly while TBP+RPO reduces it near zero with retained utility.

Significance. If the Selective-Exclusion contract is appropriate for clinical delegation, the work supplies a clean theoretical characterisation of coherence in hierarchical L2D together with both exact and scalable algorithmic remedies whose fixed-point coherence is shown by construction. The recursive structure shared between training and inference is a technical strength. The empirical demonstration that even Bayes-optimal nodewise decisions can be incoherent supplies a concrete motivation for the framework.

major comments (1)

[Abstract / Formalisation section] The Selective-Exclusion handoff contract is load-bearing for the definition of coherence, the Bayes-optimal characterisation, the proof that nodewise L2D can be incoherent, and both proposed algorithms. The manuscript derives all results inside this contract but provides no external validation (clinician studies, workflow analysis, or sensitivity checks against alternatives such as mandatory ancestor deferral) that the contract matches real medical-imaging delegation semantics. If clinicians require different consistency rules, the incoherence metric, optimality claim, and reported utility retention become inapplicable.

minor comments (1)

[Empirical evaluation] The abstract refers to 'real-reader and controlled-expert medical-imaging benchmarks' without naming the datasets, reporting sample sizes, or describing the expert simulation protocol; the full paper should include these details plus ablations isolating the contribution of TBP versus RPO.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for highlighting the foundational role of the Selective-Exclusion handoff contract and for the constructive suggestion regarding external validation. We address the comment below and outline targeted revisions to improve transparency and robustness.

read point-by-point responses

Referee: [Abstract / Formalisation section] The Selective-Exclusion handoff contract is load-bearing for the definition of coherence, the Bayes-optimal characterisation, the proof that nodewise L2D can be incoherent, and both proposed algorithms. The manuscript derives all results inside this contract but provides no external validation (clinician studies, workflow analysis, or sensitivity checks against alternatives such as mandatory ancestor deferral) that the contract matches real medical-imaging delegation semantics. If clinicians require different consistency rules, the incoherence metric, optimality claim, and reported utility retention become inapplicable.

Authors: We agree that the Selective-Exclusion handoff contract is central to the framework, as it precisely defines the feasible set of coherent deferral actions (allowing a model to assert some findings while selectively deferring others without creating taxonomic contradictions or delegation violations). The contract is motivated directly from the hierarchical structure of clinical taxonomies in medical imaging, where delegation to an expert on a parent node does not necessarily require deferring all descendants. We acknowledge that the manuscript provides no new clinician studies, workflow analyses, or explicit sensitivity checks against alternatives such as mandatory ancestor deferral. In the revision we will add a dedicated discussion subsection that: (i) elaborates the clinical motivation for Selective-Exclusion with references to related medical decision-support literature; (ii) performs a sensitivity analysis on the existing benchmarks by recomputing incoherence and utility under a mandatory-ancestor-deferral variant of the contract; and (iii) explicitly states the limitation and positions external clinician validation as an important avenue for future work. These changes make the modelling assumptions more transparent and demonstrate robustness without overstating empirical scope. revision: partial

Circularity Check

0 steps flagged

No significant circularity; derivation self-contained under explicit modeling contract

full rationale

The paper introduces the Selective-Exclusion handoff contract as an explicit modeling assumption that defines the semantics of coherent deferral actions. Within this contract it derives the Bayes-optimal rule, demonstrates that nodewise L2D can violate coherence, and constructs two enforcement procedures (exact DP projection and TBP+RPO recursion) whose outputs are guaranteed coherent by operating inside the same action set and recursion. This is standard formal modeling rather than a circular reduction: the coherence property is not smuggled in via self-citation or fitted to the same data used for evaluation; it is the direct consequence of the chosen contract and the optimization that respects it. No equations in the abstract reduce a claimed prediction to an input quantity by construction, and the empirical benchmarks compare methods under identical contract assumptions, supplying independent content.

Axiom & Free-Parameter Ledger

1 free parameters · 2 axioms · 2 invented entities

The work introduces a new contract and two algorithmic constructs whose correctness depends on standard probabilistic assumptions plus the claim that the Selective-Exclusion semantics match clinical practice.

free parameters (1)

per-node deferral thresholds
Implicit in any L2D policy; the abstract does not state whether they are tuned on held-out data or fixed by the contract.

axioms (2)

domain assumption The label taxonomy is a fixed, known DAG supplied at training time.
Invoked when defining coherent action sets and when running belief propagation over the hierarchy.
domain assumption Expert labels are available for the subset of cases the model defers.
Required for the utility evaluation of any L2D system.

invented entities (2)

Selective-Exclusion handoff contract no independent evidence
purpose: Defines which combinations of model assertions and deferrals are considered coherent.
Newly introduced; no independent evidence outside the paper is supplied.
Taxonomic Belief Propagation (TBP) with Recursive Policy Optimisation (RPO) no independent evidence
purpose: Joint action model whose recursion enforces coherence at both training and inference.
New algorithmic construct; correctness argued by construction but not machine-checked.

pith-pipeline@v0.9.0 · 5531 in / 1657 out tokens · 56738 ms · 2026-05-08T18:44:41.572658+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

Cost.FunctionalEquation / Foundation.LogicAsFunctionalEquation washburn_uniqueness_aczel (J(x)=½(x+x⁻¹)−1) unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We formalise coherent hierarchical deferral under a Selective-Exclusion handoff contract, characterise the Bayes-optimal coherent deferral rule... ρ_v(0|x)=w_v P(Y_v=1|x), ρ_v(1|x)=w_v P(Y_v=0|x), ρ_v(⊥|x)=w_v{P(M_v≠Y_v|x)+λ_v}.
Foundation.Atomicity (topoSort over tree precedence) topoSort_respects unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

On a tree, (6) is solved by dynamic programming... F_v(i) = max_{j∈Γ_SE(i)} [log η_v(j|x) + Σ_{u∈C(v)} F_u(j)]. A single decode costs O(|V||A|^2), effectively linear since |A|=3.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

52 extracted references · 8 canonical work pages · 1 internal anchor

[1]

Bibb Allen, Sheela Agarwal, Laura Coombs, Christoph Wald, and Keith Dreyer. 2021. 2020 ACR Data Science Institute artificial intelligence survey.Journal of the American College of Radiology18, 8 (2021), 1153–1159

2021
[2]

Jean Vieira Alves, Diogo Leitão, Sérgio Jesus, Marco O. P. Sampaio, Javier Liébana, Pedro Saleiro, Mario A. T. Figueiredo, and Pedro Bizarro. 2024. Cost-Sensitive Learning to Defer to Multiple Experts with Workload Constraints.Transactions on Machine Learning Research (2024).https://openreview.net/forum?id=TAvGZm2Rqb

2024
[3]

Becker, Elmar Kotter, Laure Fournier, and Luis Martí-Bonmatí

Christoph D. Becker, Elmar Kotter, Laure Fournier, and Luis Martí-Bonmatí. 2022. Current practical experience with artificial intelligence in clinical radiology: a survey of the European Society of Radiology.Insights into Imaging13, 1 (June 2022). doi: 10.1186/ s13244-022-01247-y

2022
[4]

L Bos and K Donnelly. 2006. SNOMED-CT: The advanced terminology and coding system for eHealth.Stud Health Technol Inform121 (2006), 279–290

2006
[5]

Aurelia Bustos, Antonio Pertusa, Jose-Maria Salinas, and Maria de la Iglesia-Vayá. 2020. PadChest: A large chest x-ray image dataset with multi-label annotated reports.Medical Image Analysis66 (2020), 101797. doi:10.1016/j.media.2020.101797

work page doi:10.1016/j.media.2020.101797 2020
[6]

Haomin Chen, Shun Miao, Daguang Xu, Gregory D Hager, and Adam P Harrison. 2019. Deep hierarchical multi-label classification of chest X-ray images. InInternational conference on medical imaging with deep learning. PMLR, 109–120

2019
[7]

C. Chow. 1970. On optimum recognition error and reject tradeoff.IEEE Transactions on Information Theory16, 1 (1970), 41–46. doi:10.1109/TIT.1970.1054406

work page doi:10.1109/tit.1970.1054406 1970
[8]

C. K. Chow. 1957. An optimum character recognition system using decision functions.IRE TransactionsonElectronicComputersEC-6,4(1957),247–254.doi: 10.1109/TEC.1957.5222035

work page doi:10.1109/tec.1957.5222035 1957
[9]

On the Foundations of Noise-free Selective Classification.Journal of Machine Learning Research11, 5 (2010)

Ran El-Yaniv et al.2010. On the Foundations of Noise-free Selective Classification.Journal of Machine Learning Research11, 5 (2010)

2010
[10]

Fatma A Eltawil, Michael Atalla, Emily Boulos, Afsaneh Amirabadi, and Pascal N Tyrrell
[11]

Analyzingbarriersandenablersfortheacceptanceofartificialintelligenceinnovations into radiology practice: a scoping review.Tomography9, 4 (2023), 1443–1455

2023
[12]

Yonatan Geifman and Ran El-Yaniv. 2017. Selective classification for deep neural networks. Advances in neural information processing systems30 (2017)

2017
[13]

Yonatan Geifman and Ran El-Yaniv. 2019. Selectivenet: A deep neural network with an integrated reject option. InInternational conference on machine learning. PMLR, 2151–2159

2019
[14]

Eleonora Giunchiglia and Thomas Lukasiewicz. 2020. Coherent hierarchical multi-label classificationnetworks.Advancesinneuralinformationprocessingsystems33(2020),9662–9673. 12

2020
[15]

Hierarchicalselectiveclassification.Advances in Neural Information Processing Systems37 (2024), 111047–111073

ShaniGoren,IdoGalil,andRanEl-Yaniv.2024. Hierarchicalselectiveclassification.Advances in Neural Information Processing Systems37 (2024), 111047–111073

2024
[16]

Chexpert: A largechestradiographdatasetwithuncertaintylabelsandexpertcomparison.InProceedings of the AAAI conference on artificial intelligence, Vol

Jeremy Irvin, Pranav Rajpurkar, Michael Ko, Yifan Yu, Silviana Ciurea-Ilcus, Chris Chute, HenrikMarklund,BehzadHaghgoo,RobynBall,KatieShpanskaya,etal .2019. Chexpert: A largechestradiographdatasetwithuncertaintylabelsandexpertcomparison.InProceedings of the AAAI conference on artificial intelligence, Vol. 33. 590–597

2019
[17]

Haoran Lai, Qingsong Yao, Zihang Jiang, Rongsheng Wang, Zhiyang He, Xiaodong Tao, and S Kevin Zhou. 2024. Carzero: Cross-attention alignment for radiology zero-shot classification. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 11137–11146

2024
[18]

Curtis P Langlotz. 2006. RadLex: a new method for indexing online educational materials. Radiographics26, 6 (Nov. 2006), 1595–1597

2006
[19]

David Madras, Toni Pitassi, and Richard Zemel. 2018. Predict responsibly: improving fairness and accuracy by learning to defer.Advances in neural information processing systems 31 (2018)

2018
[20]

Two-stagelearning to defer with multiple experts.Advances in neural information processing systems36 (2023), 3578–3606

AnqiMao,ChristopherMohri,MehryarMohri,andYutaoZhong.2023. Two-stagelearning to defer with multiple experts.Advances in neural information processing systems36 (2023), 3578–3606

2023
[21]

Yannis Montreuil, Axel Carlier, Lai Xing Ng, and Wei Tsang Ooi. 2025. Why Ask One When You Can Ask𝑘? Learning-to-Defer to the Top-𝑘 Experts. arXiv:2504.12988 [cs.LG] https://arxiv.org/abs/2504.12988

work page internal anchor Pith review arXiv 2025
[22]

Yannis Montreuil, Shu Heng Yeo, Axel Carlier, Lai Xing Ng, and Wei Tsang Ooi. 2024. Two-stage Learning-to-Defer for Multi-Task Learning.arXiv preprint arXiv:2410.15729 (2024)

work page arXiv 2024
[23]

Hussein Mozannar, Hunter Lang, Dennis Wei, Prasanna Sattigeri, Subhro Das, and David Sontag. 2023. Who should predict? exact algorithms for learning to defer to humans. In International conference on artificial intelligence and statistics. PMLR, 10520–10545

2023
[24]

Hussein Mozannar and David Sontag. 2020. Consistent Estimators for Learning to Defer to an Expert. InInternational Conference on Machine Learning. PMLR, 7076–7087

2020
[25]

Harikrishna Narasimhan, Wittawat Jitkrittum, Aditya K Menon, Ankit Rawat, and Sanjiv Kumar. 2022. Post-hoc estimators for learning to defer to an expert.Advances in Neural Information Processing Systems35 (2022), 29292–29304

2022
[26]

Cuong C Nguyen, Thanh-Toan Do, and Gustavo Carneiro. 2025. Probabilistic learning to defer: Handling missing expert annotations and controlling workload distribution. InThe Thirteenth International Conference on Learning Representations

2025
[27]

Ha Quy Nguyen, Hieu Huy Pham, le tuan linh, Minh Dao, and lam khanh. 2021. VinDr- CXR: An open dataset of chest X-rays with radiologist annotations.PhysioNet(June 2021). doi:10.13026/3akn-b287Version 1.0.0

work page doi:10.13026/3akn-b287version 2021
[28]

InInternational Conference on Pattern Recognition

AndrewPonomarev.2024.ASimpleHeuristicforControllingHumanWorkloadinLearning to Defer. InInternational Conference on Pattern Recognition. Springer, 120–130. 13

2024
[29]

Learning to defer: A survey

JoshuaStrong,EmmaSun,HarryRogers,HelenHigham,andAlisonNoble.2025. Learning to defer: A survey. (Dec. 2025)

2025
[30]

Ekin Tiu, Ellie Talius, Pujan Patel, Curtis P Langlotz, Andrew Y Ng, and Pranav Rajpurkar
[31]

Expert-level detection of pathologies from unannotated chest X-ray images via self-supervised learning.Nature biomedical engineering6, 12 (2022), 1399–1406

2022
[32]

Rajeev Verma, Daniel Barrejón, and Eric Nalisnick. 2023. Learning to defer to multiple experts: Consistent surrogate losses, confidence calibration, and conformal ensembles. In International Conference on Artificial Intelligence and Statistics. PMLR, 11415–11434

2023
[33]

Adpv2: A hierarchical histological tissue type-annotated dataset for potential biomarker discovery of colorectal disease.Journal of Pathology Informatics(2025), 100537

Zhiyuan Yang, Kai Li, Sophia Ghamoshi Ramandi, Patricia Brassard, Hakim Khellaf, Vincent Quoc-Huy Trinh, Jennifer Zhang, Lina Chen, Corwyn Rowsell, Sonal Varma, et al.2025. Adpv2: A hierarchical histological tissue type-annotated dataset for potential biomarker discovery of colorectal disease.Journal of Pathology Informatics(2025), 100537

2025
[34]

Min-Ling Zhang, Yu-Kun Li, Xu-Ying Liu, and Xin Geng. 2018. Binary relevance for multi-label learning: an overview.Frontiers of Computer Science12, 2 (2018), 191–202

2018
[35]

Xiaoman Zhang, Chaoyi Wu, Ya Zhang, Weidi Xie, and Yanfeng Wang. 2023. Knowledge- enhanced visual-language pre-training on chest radiology images.Nature Communications 14, 1 (2023), 4542. Contents of Appendix 7 Extended Literature Review 16 8 An Informal Walk-Through of Deferral Coherence 16 9 Alternative Deferral Semantics: Strong Subtree Handoff 18 9.1 ...

2023
[36]

Couldthemodel’sassertionsallbetruetogether?Ifnot,wehavea taxonomic contradiction
[37]

Is the model deferring a parent while still making a child-present claim that already commits the parent?If yes, we have adelegation violation
[38]

the expert should decide the parent

Isthemodeldeferringanodewhosevalueisalreadyimpliedbyitsownearlierassertions? If yes, we have adeductive defect . Taxonomic contradiction: saying something impossible.The familiar bad case isparent absent, child present. For example,Lung Opacity = absentandEdema = presentcannot both be true. Another way to say this is that, after leaving deferred nodes und...
[39]

do not assert an impossible parent–child combination
[40]

do not defer a question while answering it indirectly through a descendant
[41]

Section 3 makes these rules precise, and Section 4 then builds estimators whose predictions respect this coherent decision space

do not defer a label that your own assertions have already determined. Section 3 makes these rules precise, and Section 4 then builds estimators whose predictions respect this coherent decision space. 9 Alternative Deferral Semantics: Strong Subtree Handoff Our main method adopts theSelective-Exclusioncontract (Definition 3), which allows selectively excl...

work page arXiv
[42]

If the parent is asserted absent, then the child is forced absent:P(𝑎𝑣 =1|𝑎 𝑝𝑎(𝑣)=0, 𝑥)=P(𝑎𝑣 =⊥| 𝑎𝑝𝑎(𝑣)=0, 𝑥)=0
[43]

If the parent is deferred, the child cannot be asserted present:P(𝑎𝑣 =1|𝑎 𝑝𝑎(𝑣)=⊥, 𝑥)=0. Proof. Both claims follow immediately from the corresponding rows of the transition matrix (18). The first row is[1,0,0], and the third row is[𝛼𝑣(𝑥),0,1−𝛼𝑣(𝑥)]. Proposition 5(Coherent handoff preserves label consistency).Let the completed system label be ˆ𝑦 sys 𝑣 (𝑎, ...
[44]

Contract-induced non-separability (gating).Under a deferral contract (Selective-Exclusion), internal-node actions change the feasible action space of descendants. Therefore, the Bayes- optimalcoherentpolicyisgenerallynotseparableacrossnodes,whereasmanyStageIobjectives are locally trained (e.g., BR-style L2D, per-node surrogates, or masked classification losses)
[45]

predict vs. defer

Composition/semantics mismatch.Stage I may optimise predictions under a hierarchy semantics that differs from TBP’scontract-constrained ternarycomposition. Even when Stage I is hierarchy-aware (e.g., constraint/closure-based HMLC surrogates such as MCLoss), the resulting parameters need not be stationary for the TBP-induced marginals that are used at infe...
[46]

taxonomic contradiction: parent absent and any child present 36
[47]

delegation violation: parent defer and any child present
[48]

deductive defect: parent absent, no child present, and any child defer
[49]

Edge-weighted incoherence.In addition to the neighbourhood partition, we also compute an edge-weighted view in which the unit of analysis is the immediate parent–child edge

coherent: all remaining neighbourhoods Finally, AU-Neigh-Any= ∫ 1 0 𝑅any(𝑏)𝑑𝑏, and similarly for the defect-specific neighbourhood-partition rates used throughout the paper. Edge-weighted incoherence.In addition to the neighbourhood partition, we also compute an edge-weighted view in which the unit of analysis is the immediate parent–child edge. Let ℰkept...
[50]

deferral coherence is still defined over ternary handoff actions rather than binary labels
[51]

Selective-Exclusion still forbids positive child assertions under absent or deferred parents
[52]

What changes is the inference problem

coherent projection and coherent-support joint action models can still be formulated exactly. What changes is the inference problem. Trees permit simple linear-time dynamic programs. DAGs require either exact bounded-treewidth inference, integer-programming decoders, or approximate structured inference. Thus, extending the present framework to realistic D...

work page arXiv

[1] [1]

Bibb Allen, Sheela Agarwal, Laura Coombs, Christoph Wald, and Keith Dreyer. 2021. 2020 ACR Data Science Institute artificial intelligence survey.Journal of the American College of Radiology18, 8 (2021), 1153–1159

2021

[2] [2]

Jean Vieira Alves, Diogo Leitão, Sérgio Jesus, Marco O. P. Sampaio, Javier Liébana, Pedro Saleiro, Mario A. T. Figueiredo, and Pedro Bizarro. 2024. Cost-Sensitive Learning to Defer to Multiple Experts with Workload Constraints.Transactions on Machine Learning Research (2024).https://openreview.net/forum?id=TAvGZm2Rqb

2024

[3] [3]

Becker, Elmar Kotter, Laure Fournier, and Luis Martí-Bonmatí

Christoph D. Becker, Elmar Kotter, Laure Fournier, and Luis Martí-Bonmatí. 2022. Current practical experience with artificial intelligence in clinical radiology: a survey of the European Society of Radiology.Insights into Imaging13, 1 (June 2022). doi: 10.1186/ s13244-022-01247-y

2022

[4] [4]

L Bos and K Donnelly. 2006. SNOMED-CT: The advanced terminology and coding system for eHealth.Stud Health Technol Inform121 (2006), 279–290

2006

[5] [5]

Aurelia Bustos, Antonio Pertusa, Jose-Maria Salinas, and Maria de la Iglesia-Vayá. 2020. PadChest: A large chest x-ray image dataset with multi-label annotated reports.Medical Image Analysis66 (2020), 101797. doi:10.1016/j.media.2020.101797

work page doi:10.1016/j.media.2020.101797 2020

[6] [6]

Haomin Chen, Shun Miao, Daguang Xu, Gregory D Hager, and Adam P Harrison. 2019. Deep hierarchical multi-label classification of chest X-ray images. InInternational conference on medical imaging with deep learning. PMLR, 109–120

2019

[7] [7]

C. Chow. 1970. On optimum recognition error and reject tradeoff.IEEE Transactions on Information Theory16, 1 (1970), 41–46. doi:10.1109/TIT.1970.1054406

work page doi:10.1109/tit.1970.1054406 1970

[8] [8]

C. K. Chow. 1957. An optimum character recognition system using decision functions.IRE TransactionsonElectronicComputersEC-6,4(1957),247–254.doi: 10.1109/TEC.1957.5222035

work page doi:10.1109/tec.1957.5222035 1957

[9] [9]

On the Foundations of Noise-free Selective Classification.Journal of Machine Learning Research11, 5 (2010)

Ran El-Yaniv et al.2010. On the Foundations of Noise-free Selective Classification.Journal of Machine Learning Research11, 5 (2010)

2010

[10] [10]

Fatma A Eltawil, Michael Atalla, Emily Boulos, Afsaneh Amirabadi, and Pascal N Tyrrell

[11] [11]

Analyzingbarriersandenablersfortheacceptanceofartificialintelligenceinnovations into radiology practice: a scoping review.Tomography9, 4 (2023), 1443–1455

2023

[12] [12]

Yonatan Geifman and Ran El-Yaniv. 2017. Selective classification for deep neural networks. Advances in neural information processing systems30 (2017)

2017

[13] [13]

Yonatan Geifman and Ran El-Yaniv. 2019. Selectivenet: A deep neural network with an integrated reject option. InInternational conference on machine learning. PMLR, 2151–2159

2019

[14] [14]

Eleonora Giunchiglia and Thomas Lukasiewicz. 2020. Coherent hierarchical multi-label classificationnetworks.Advancesinneuralinformationprocessingsystems33(2020),9662–9673. 12

2020

[15] [15]

Hierarchicalselectiveclassification.Advances in Neural Information Processing Systems37 (2024), 111047–111073

ShaniGoren,IdoGalil,andRanEl-Yaniv.2024. Hierarchicalselectiveclassification.Advances in Neural Information Processing Systems37 (2024), 111047–111073

2024

[16] [16]

Chexpert: A largechestradiographdatasetwithuncertaintylabelsandexpertcomparison.InProceedings of the AAAI conference on artificial intelligence, Vol

Jeremy Irvin, Pranav Rajpurkar, Michael Ko, Yifan Yu, Silviana Ciurea-Ilcus, Chris Chute, HenrikMarklund,BehzadHaghgoo,RobynBall,KatieShpanskaya,etal .2019. Chexpert: A largechestradiographdatasetwithuncertaintylabelsandexpertcomparison.InProceedings of the AAAI conference on artificial intelligence, Vol. 33. 590–597

2019

[17] [17]

Haoran Lai, Qingsong Yao, Zihang Jiang, Rongsheng Wang, Zhiyang He, Xiaodong Tao, and S Kevin Zhou. 2024. Carzero: Cross-attention alignment for radiology zero-shot classification. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 11137–11146

2024

[18] [18]

Curtis P Langlotz. 2006. RadLex: a new method for indexing online educational materials. Radiographics26, 6 (Nov. 2006), 1595–1597

2006

[19] [19]

David Madras, Toni Pitassi, and Richard Zemel. 2018. Predict responsibly: improving fairness and accuracy by learning to defer.Advances in neural information processing systems 31 (2018)

2018

[20] [20]

Two-stagelearning to defer with multiple experts.Advances in neural information processing systems36 (2023), 3578–3606

AnqiMao,ChristopherMohri,MehryarMohri,andYutaoZhong.2023. Two-stagelearning to defer with multiple experts.Advances in neural information processing systems36 (2023), 3578–3606

2023

[21] [21]

Yannis Montreuil, Axel Carlier, Lai Xing Ng, and Wei Tsang Ooi. 2025. Why Ask One When You Can Ask𝑘? Learning-to-Defer to the Top-𝑘 Experts. arXiv:2504.12988 [cs.LG] https://arxiv.org/abs/2504.12988

work page internal anchor Pith review arXiv 2025

[22] [22]

Yannis Montreuil, Shu Heng Yeo, Axel Carlier, Lai Xing Ng, and Wei Tsang Ooi. 2024. Two-stage Learning-to-Defer for Multi-Task Learning.arXiv preprint arXiv:2410.15729 (2024)

work page arXiv 2024

[23] [23]

Hussein Mozannar, Hunter Lang, Dennis Wei, Prasanna Sattigeri, Subhro Das, and David Sontag. 2023. Who should predict? exact algorithms for learning to defer to humans. In International conference on artificial intelligence and statistics. PMLR, 10520–10545

2023

[24] [24]

Hussein Mozannar and David Sontag. 2020. Consistent Estimators for Learning to Defer to an Expert. InInternational Conference on Machine Learning. PMLR, 7076–7087

2020

[25] [25]

Harikrishna Narasimhan, Wittawat Jitkrittum, Aditya K Menon, Ankit Rawat, and Sanjiv Kumar. 2022. Post-hoc estimators for learning to defer to an expert.Advances in Neural Information Processing Systems35 (2022), 29292–29304

2022

[26] [26]

Cuong C Nguyen, Thanh-Toan Do, and Gustavo Carneiro. 2025. Probabilistic learning to defer: Handling missing expert annotations and controlling workload distribution. InThe Thirteenth International Conference on Learning Representations

2025

[27] [27]

Ha Quy Nguyen, Hieu Huy Pham, le tuan linh, Minh Dao, and lam khanh. 2021. VinDr- CXR: An open dataset of chest X-rays with radiologist annotations.PhysioNet(June 2021). doi:10.13026/3akn-b287Version 1.0.0

work page doi:10.13026/3akn-b287version 2021

[28] [28]

InInternational Conference on Pattern Recognition

AndrewPonomarev.2024.ASimpleHeuristicforControllingHumanWorkloadinLearning to Defer. InInternational Conference on Pattern Recognition. Springer, 120–130. 13

2024

[29] [29]

Learning to defer: A survey

JoshuaStrong,EmmaSun,HarryRogers,HelenHigham,andAlisonNoble.2025. Learning to defer: A survey. (Dec. 2025)

2025

[30] [30]

Ekin Tiu, Ellie Talius, Pujan Patel, Curtis P Langlotz, Andrew Y Ng, and Pranav Rajpurkar

[31] [31]

Expert-level detection of pathologies from unannotated chest X-ray images via self-supervised learning.Nature biomedical engineering6, 12 (2022), 1399–1406

2022

[32] [32]

Rajeev Verma, Daniel Barrejón, and Eric Nalisnick. 2023. Learning to defer to multiple experts: Consistent surrogate losses, confidence calibration, and conformal ensembles. In International Conference on Artificial Intelligence and Statistics. PMLR, 11415–11434

2023

[33] [33]

Adpv2: A hierarchical histological tissue type-annotated dataset for potential biomarker discovery of colorectal disease.Journal of Pathology Informatics(2025), 100537

Zhiyuan Yang, Kai Li, Sophia Ghamoshi Ramandi, Patricia Brassard, Hakim Khellaf, Vincent Quoc-Huy Trinh, Jennifer Zhang, Lina Chen, Corwyn Rowsell, Sonal Varma, et al.2025. Adpv2: A hierarchical histological tissue type-annotated dataset for potential biomarker discovery of colorectal disease.Journal of Pathology Informatics(2025), 100537

2025

[34] [34]

Min-Ling Zhang, Yu-Kun Li, Xu-Ying Liu, and Xin Geng. 2018. Binary relevance for multi-label learning: an overview.Frontiers of Computer Science12, 2 (2018), 191–202

2018

[35] [35]

Xiaoman Zhang, Chaoyi Wu, Ya Zhang, Weidi Xie, and Yanfeng Wang. 2023. Knowledge- enhanced visual-language pre-training on chest radiology images.Nature Communications 14, 1 (2023), 4542. Contents of Appendix 7 Extended Literature Review 16 8 An Informal Walk-Through of Deferral Coherence 16 9 Alternative Deferral Semantics: Strong Subtree Handoff 18 9.1 ...

2023

[36] [36]

Couldthemodel’sassertionsallbetruetogether?Ifnot,wehavea taxonomic contradiction

[37] [37]

Is the model deferring a parent while still making a child-present claim that already commits the parent?If yes, we have adelegation violation

[38] [38]

the expert should decide the parent

Isthemodeldeferringanodewhosevalueisalreadyimpliedbyitsownearlierassertions? If yes, we have adeductive defect . Taxonomic contradiction: saying something impossible.The familiar bad case isparent absent, child present. For example,Lung Opacity = absentandEdema = presentcannot both be true. Another way to say this is that, after leaving deferred nodes und...

[39] [39]

do not assert an impossible parent–child combination

[40] [40]

do not defer a question while answering it indirectly through a descendant

[41] [41]

Section 3 makes these rules precise, and Section 4 then builds estimators whose predictions respect this coherent decision space

do not defer a label that your own assertions have already determined. Section 3 makes these rules precise, and Section 4 then builds estimators whose predictions respect this coherent decision space. 9 Alternative Deferral Semantics: Strong Subtree Handoff Our main method adopts theSelective-Exclusioncontract (Definition 3), which allows selectively excl...

work page arXiv

[42] [42]

If the parent is asserted absent, then the child is forced absent:P(𝑎𝑣 =1|𝑎 𝑝𝑎(𝑣)=0, 𝑥)=P(𝑎𝑣 =⊥| 𝑎𝑝𝑎(𝑣)=0, 𝑥)=0

[43] [43]

If the parent is deferred, the child cannot be asserted present:P(𝑎𝑣 =1|𝑎 𝑝𝑎(𝑣)=⊥, 𝑥)=0. Proof. Both claims follow immediately from the corresponding rows of the transition matrix (18). The first row is[1,0,0], and the third row is[𝛼𝑣(𝑥),0,1−𝛼𝑣(𝑥)]. Proposition 5(Coherent handoff preserves label consistency).Let the completed system label be ˆ𝑦 sys 𝑣 (𝑎, ...

[44] [44]

Contract-induced non-separability (gating).Under a deferral contract (Selective-Exclusion), internal-node actions change the feasible action space of descendants. Therefore, the Bayes- optimalcoherentpolicyisgenerallynotseparableacrossnodes,whereasmanyStageIobjectives are locally trained (e.g., BR-style L2D, per-node surrogates, or masked classification losses)

[45] [45]

predict vs. defer

Composition/semantics mismatch.Stage I may optimise predictions under a hierarchy semantics that differs from TBP’scontract-constrained ternarycomposition. Even when Stage I is hierarchy-aware (e.g., constraint/closure-based HMLC surrogates such as MCLoss), the resulting parameters need not be stationary for the TBP-induced marginals that are used at infe...

[46] [46]

taxonomic contradiction: parent absent and any child present 36

[47] [47]

delegation violation: parent defer and any child present

[48] [48]

deductive defect: parent absent, no child present, and any child defer

[49] [49]

Edge-weighted incoherence.In addition to the neighbourhood partition, we also compute an edge-weighted view in which the unit of analysis is the immediate parent–child edge

coherent: all remaining neighbourhoods Finally, AU-Neigh-Any= ∫ 1 0 𝑅any(𝑏)𝑑𝑏, and similarly for the defect-specific neighbourhood-partition rates used throughout the paper. Edge-weighted incoherence.In addition to the neighbourhood partition, we also compute an edge-weighted view in which the unit of analysis is the immediate parent–child edge. Let ℰkept...

[50] [50]

deferral coherence is still defined over ternary handoff actions rather than binary labels

[51] [51]

Selective-Exclusion still forbids positive child assertions under absent or deferred parents

[52] [52]

What changes is the inference problem

coherent projection and coherent-support joint action models can still be formulated exactly. What changes is the inference problem. Trees permit simple linear-time dynamic programs. DAGs require either exact bounded-treewidth inference, integer-programming decoders, or approximate structured inference. Thus, extending the present framework to realistic D...

work page arXiv