Coherent Hierarchical Multi-Label Learning to Defer for Medical Imaging
Pith reviewed 2026-05-08 18:44 UTC · model grok-4.3
The pith
Hierarchical medical imaging deferral requires a selective-exclusion contract to prevent contradictory delegation actions.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
We formalise coherent hierarchical deferral under a Selective-Exclusion handoff contract, characterise the Bayes-optimal coherent deferral rule, and show that even nodewise Bayes L2D can be action-incoherent. Projection removes it exactly, and fast TBP+RPO drives incoherence near zero while retaining strong utility.
What carries the argument
The Selective-Exclusion handoff contract, which enforces that deferral of one label in the hierarchy excludes independent assertion or deferral of related parent or child labels to maintain consistency.
Load-bearing premise
The selective-exclusion handoff contract must accurately reflect the delegation rules clinicians actually require in medical imaging workflows.
What would settle it
A controlled study comparing error rates and clinician override frequency when using coherent versus nodewise-incoherent deferral outputs on the same hierarchical medical cases.
Figures
read the original abstract
Learning to Defer (L2D) enables a model to predict autonomously or defer to an expert, but prior work largely assumes flat label spaces. We study the first L2D setting with hierarchical multi-label decisions, motivated by medical-imaging workflows in which findings are organised by clinical taxonomies. In this setting, deferral is a delegation action rather than a label assignment, so treating it as an independent per-label decision can produce deferral incoherence, including taxonomic contradictions, delegation violations, and deferrals of labels already implied by the model's own assertions. We formalise coherent hierarchical deferral under a Selective-Exclusion handoff contract, characterise the Bayes-optimal coherent deferral rule, and show that even nodewise Bayes L2D can be action-incoherent. We then propose two remedies: exact coherent projection, a dynamic-programming decoder over the coherent action set, and Taxonomic Belief Propagation (TBP) with Recursive Policy Optimisation (RPO), a contract-aware joint action model trained through the same recursion used at inference. Across real-reader and controlled-expert medical-imaging benchmarks, naive binary-relevance L2D exhibits non-trivial incoherence. Projection removes it exactly, and fast TBP+RPO drives incoherence near zero while retaining strong utility.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper studies Learning to Defer (L2D) in hierarchical multi-label settings for medical imaging. It introduces a Selective-Exclusion handoff contract to define coherent deferral actions, characterizes the Bayes-optimal coherent deferral rule, shows that nodewise Bayes L2D can produce action-incoherent decisions (taxonomic contradictions, delegation violations), and proposes two remedies: an exact dynamic-programming projection decoder over the coherent action set, and Taxonomic Belief Propagation (TBP) combined with Recursive Policy Optimisation (RPO) that trains a joint action model using the same recursion. On real-reader and controlled-expert medical-imaging benchmarks, naive binary-relevance L2D shows non-trivial incoherence; projection eliminates it exactly while TBP+RPO reduces it near zero with retained utility.
Significance. If the Selective-Exclusion contract is appropriate for clinical delegation, the work supplies a clean theoretical characterisation of coherence in hierarchical L2D together with both exact and scalable algorithmic remedies whose fixed-point coherence is shown by construction. The recursive structure shared between training and inference is a technical strength. The empirical demonstration that even Bayes-optimal nodewise decisions can be incoherent supplies a concrete motivation for the framework.
major comments (1)
- [Abstract / Formalisation section] The Selective-Exclusion handoff contract is load-bearing for the definition of coherence, the Bayes-optimal characterisation, the proof that nodewise L2D can be incoherent, and both proposed algorithms. The manuscript derives all results inside this contract but provides no external validation (clinician studies, workflow analysis, or sensitivity checks against alternatives such as mandatory ancestor deferral) that the contract matches real medical-imaging delegation semantics. If clinicians require different consistency rules, the incoherence metric, optimality claim, and reported utility retention become inapplicable.
minor comments (1)
- [Empirical evaluation] The abstract refers to 'real-reader and controlled-expert medical-imaging benchmarks' without naming the datasets, reporting sample sizes, or describing the expert simulation protocol; the full paper should include these details plus ablations isolating the contribution of TBP versus RPO.
Simulated Author's Rebuttal
We thank the referee for highlighting the foundational role of the Selective-Exclusion handoff contract and for the constructive suggestion regarding external validation. We address the comment below and outline targeted revisions to improve transparency and robustness.
read point-by-point responses
-
Referee: [Abstract / Formalisation section] The Selective-Exclusion handoff contract is load-bearing for the definition of coherence, the Bayes-optimal characterisation, the proof that nodewise L2D can be incoherent, and both proposed algorithms. The manuscript derives all results inside this contract but provides no external validation (clinician studies, workflow analysis, or sensitivity checks against alternatives such as mandatory ancestor deferral) that the contract matches real medical-imaging delegation semantics. If clinicians require different consistency rules, the incoherence metric, optimality claim, and reported utility retention become inapplicable.
Authors: We agree that the Selective-Exclusion handoff contract is central to the framework, as it precisely defines the feasible set of coherent deferral actions (allowing a model to assert some findings while selectively deferring others without creating taxonomic contradictions or delegation violations). The contract is motivated directly from the hierarchical structure of clinical taxonomies in medical imaging, where delegation to an expert on a parent node does not necessarily require deferring all descendants. We acknowledge that the manuscript provides no new clinician studies, workflow analyses, or explicit sensitivity checks against alternatives such as mandatory ancestor deferral. In the revision we will add a dedicated discussion subsection that: (i) elaborates the clinical motivation for Selective-Exclusion with references to related medical decision-support literature; (ii) performs a sensitivity analysis on the existing benchmarks by recomputing incoherence and utility under a mandatory-ancestor-deferral variant of the contract; and (iii) explicitly states the limitation and positions external clinician validation as an important avenue for future work. These changes make the modelling assumptions more transparent and demonstrate robustness without overstating empirical scope. revision: partial
Circularity Check
No significant circularity; derivation self-contained under explicit modeling contract
full rationale
The paper introduces the Selective-Exclusion handoff contract as an explicit modeling assumption that defines the semantics of coherent deferral actions. Within this contract it derives the Bayes-optimal rule, demonstrates that nodewise L2D can violate coherence, and constructs two enforcement procedures (exact DP projection and TBP+RPO recursion) whose outputs are guaranteed coherent by operating inside the same action set and recursion. This is standard formal modeling rather than a circular reduction: the coherence property is not smuggled in via self-citation or fitted to the same data used for evaluation; it is the direct consequence of the chosen contract and the optimization that respects it. No equations in the abstract reduce a claimed prediction to an input quantity by construction, and the empirical benchmarks compare methods under identical contract assumptions, supplying independent content.
Axiom & Free-Parameter Ledger
free parameters (1)
- per-node deferral thresholds
axioms (2)
- domain assumption The label taxonomy is a fixed, known DAG supplied at training time.
- domain assumption Expert labels are available for the subset of cases the model defers.
invented entities (2)
-
Selective-Exclusion handoff contract
no independent evidence
-
Taxonomic Belief Propagation (TBP) with Recursive Policy Optimisation (RPO)
no independent evidence
Lean theorems connected to this paper
-
Cost.FunctionalEquation / Foundation.LogicAsFunctionalEquationwashburn_uniqueness_aczel (J(x)=½(x+x⁻¹)−1) unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We formalise coherent hierarchical deferral under a Selective-Exclusion handoff contract, characterise the Bayes-optimal coherent deferral rule... ρ_v(0|x)=w_v P(Y_v=1|x), ρ_v(1|x)=w_v P(Y_v=0|x), ρ_v(⊥|x)=w_v{P(M_v≠Y_v|x)+λ_v}.
-
Foundation.Atomicity (topoSort over tree precedence)topoSort_respects unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
On a tree, (6) is solved by dynamic programming... F_v(i) = max_{j∈Γ_SE(i)} [log η_v(j|x) + Σ_{u∈C(v)} F_u(j)]. A single decode costs O(|V||A|^2), effectively linear since |A|=3.
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Bibb Allen, Sheela Agarwal, Laura Coombs, Christoph Wald, and Keith Dreyer. 2021. 2020 ACR Data Science Institute artificial intelligence survey.Journal of the American College of Radiology18, 8 (2021), 1153–1159
2021
-
[2]
Jean Vieira Alves, Diogo Leitão, Sérgio Jesus, Marco O. P. Sampaio, Javier Liébana, Pedro Saleiro, Mario A. T. Figueiredo, and Pedro Bizarro. 2024. Cost-Sensitive Learning to Defer to Multiple Experts with Workload Constraints.Transactions on Machine Learning Research (2024).https://openreview.net/forum?id=TAvGZm2Rqb
2024
-
[3]
Becker, Elmar Kotter, Laure Fournier, and Luis Martí-Bonmatí
Christoph D. Becker, Elmar Kotter, Laure Fournier, and Luis Martí-Bonmatí. 2022. Current practical experience with artificial intelligence in clinical radiology: a survey of the European Society of Radiology.Insights into Imaging13, 1 (June 2022). doi: 10.1186/ s13244-022-01247-y
2022
-
[4]
L Bos and K Donnelly. 2006. SNOMED-CT: The advanced terminology and coding system for eHealth.Stud Health Technol Inform121 (2006), 279–290
2006
-
[5]
Aurelia Bustos, Antonio Pertusa, Jose-Maria Salinas, and Maria de la Iglesia-Vayá. 2020. PadChest: A large chest x-ray image dataset with multi-label annotated reports.Medical Image Analysis66 (2020), 101797. doi:10.1016/j.media.2020.101797
-
[6]
Haomin Chen, Shun Miao, Daguang Xu, Gregory D Hager, and Adam P Harrison. 2019. Deep hierarchical multi-label classification of chest X-ray images. InInternational conference on medical imaging with deep learning. PMLR, 109–120
2019
-
[7]
C. Chow. 1970. On optimum recognition error and reject tradeoff.IEEE Transactions on Information Theory16, 1 (1970), 41–46. doi:10.1109/TIT.1970.1054406
-
[8]
C. K. Chow. 1957. An optimum character recognition system using decision functions.IRE TransactionsonElectronicComputersEC-6,4(1957),247–254.doi: 10.1109/TEC.1957.5222035
-
[9]
On the Foundations of Noise-free Selective Classification.Journal of Machine Learning Research11, 5 (2010)
Ran El-Yaniv et al.2010. On the Foundations of Noise-free Selective Classification.Journal of Machine Learning Research11, 5 (2010)
2010
-
[10]
Fatma A Eltawil, Michael Atalla, Emily Boulos, Afsaneh Amirabadi, and Pascal N Tyrrell
-
[11]
Analyzingbarriersandenablersfortheacceptanceofartificialintelligenceinnovations into radiology practice: a scoping review.Tomography9, 4 (2023), 1443–1455
2023
-
[12]
Yonatan Geifman and Ran El-Yaniv. 2017. Selective classification for deep neural networks. Advances in neural information processing systems30 (2017)
2017
-
[13]
Yonatan Geifman and Ran El-Yaniv. 2019. Selectivenet: A deep neural network with an integrated reject option. InInternational conference on machine learning. PMLR, 2151–2159
2019
-
[14]
Eleonora Giunchiglia and Thomas Lukasiewicz. 2020. Coherent hierarchical multi-label classificationnetworks.Advancesinneuralinformationprocessingsystems33(2020),9662–9673. 12
2020
-
[15]
Hierarchicalselectiveclassification.Advances in Neural Information Processing Systems37 (2024), 111047–111073
ShaniGoren,IdoGalil,andRanEl-Yaniv.2024. Hierarchicalselectiveclassification.Advances in Neural Information Processing Systems37 (2024), 111047–111073
2024
-
[16]
Chexpert: A largechestradiographdatasetwithuncertaintylabelsandexpertcomparison.InProceedings of the AAAI conference on artificial intelligence, Vol
Jeremy Irvin, Pranav Rajpurkar, Michael Ko, Yifan Yu, Silviana Ciurea-Ilcus, Chris Chute, HenrikMarklund,BehzadHaghgoo,RobynBall,KatieShpanskaya,etal .2019. Chexpert: A largechestradiographdatasetwithuncertaintylabelsandexpertcomparison.InProceedings of the AAAI conference on artificial intelligence, Vol. 33. 590–597
2019
-
[17]
Haoran Lai, Qingsong Yao, Zihang Jiang, Rongsheng Wang, Zhiyang He, Xiaodong Tao, and S Kevin Zhou. 2024. Carzero: Cross-attention alignment for radiology zero-shot classification. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 11137–11146
2024
-
[18]
Curtis P Langlotz. 2006. RadLex: a new method for indexing online educational materials. Radiographics26, 6 (Nov. 2006), 1595–1597
2006
-
[19]
David Madras, Toni Pitassi, and Richard Zemel. 2018. Predict responsibly: improving fairness and accuracy by learning to defer.Advances in neural information processing systems 31 (2018)
2018
-
[20]
Two-stagelearning to defer with multiple experts.Advances in neural information processing systems36 (2023), 3578–3606
AnqiMao,ChristopherMohri,MehryarMohri,andYutaoZhong.2023. Two-stagelearning to defer with multiple experts.Advances in neural information processing systems36 (2023), 3578–3606
2023
-
[21]
Yannis Montreuil, Axel Carlier, Lai Xing Ng, and Wei Tsang Ooi. 2025. Why Ask One When You Can Ask𝑘? Learning-to-Defer to the Top-𝑘 Experts. arXiv:2504.12988 [cs.LG] https://arxiv.org/abs/2504.12988
work page internal anchor Pith review arXiv 2025
- [22]
-
[23]
Hussein Mozannar, Hunter Lang, Dennis Wei, Prasanna Sattigeri, Subhro Das, and David Sontag. 2023. Who should predict? exact algorithms for learning to defer to humans. In International conference on artificial intelligence and statistics. PMLR, 10520–10545
2023
-
[24]
Hussein Mozannar and David Sontag. 2020. Consistent Estimators for Learning to Defer to an Expert. InInternational Conference on Machine Learning. PMLR, 7076–7087
2020
-
[25]
Harikrishna Narasimhan, Wittawat Jitkrittum, Aditya K Menon, Ankit Rawat, and Sanjiv Kumar. 2022. Post-hoc estimators for learning to defer to an expert.Advances in Neural Information Processing Systems35 (2022), 29292–29304
2022
-
[26]
Cuong C Nguyen, Thanh-Toan Do, and Gustavo Carneiro. 2025. Probabilistic learning to defer: Handling missing expert annotations and controlling workload distribution. InThe Thirteenth International Conference on Learning Representations
2025
-
[27]
Ha Quy Nguyen, Hieu Huy Pham, le tuan linh, Minh Dao, and lam khanh. 2021. VinDr- CXR: An open dataset of chest X-rays with radiologist annotations.PhysioNet(June 2021). doi:10.13026/3akn-b287Version 1.0.0
-
[28]
InInternational Conference on Pattern Recognition
AndrewPonomarev.2024.ASimpleHeuristicforControllingHumanWorkloadinLearning to Defer. InInternational Conference on Pattern Recognition. Springer, 120–130. 13
2024
-
[29]
Learning to defer: A survey
JoshuaStrong,EmmaSun,HarryRogers,HelenHigham,andAlisonNoble.2025. Learning to defer: A survey. (Dec. 2025)
2025
-
[30]
Ekin Tiu, Ellie Talius, Pujan Patel, Curtis P Langlotz, Andrew Y Ng, and Pranav Rajpurkar
-
[31]
Expert-level detection of pathologies from unannotated chest X-ray images via self-supervised learning.Nature biomedical engineering6, 12 (2022), 1399–1406
2022
-
[32]
Rajeev Verma, Daniel Barrejón, and Eric Nalisnick. 2023. Learning to defer to multiple experts: Consistent surrogate losses, confidence calibration, and conformal ensembles. In International Conference on Artificial Intelligence and Statistics. PMLR, 11415–11434
2023
-
[33]
Adpv2: A hierarchical histological tissue type-annotated dataset for potential biomarker discovery of colorectal disease.Journal of Pathology Informatics(2025), 100537
Zhiyuan Yang, Kai Li, Sophia Ghamoshi Ramandi, Patricia Brassard, Hakim Khellaf, Vincent Quoc-Huy Trinh, Jennifer Zhang, Lina Chen, Corwyn Rowsell, Sonal Varma, et al.2025. Adpv2: A hierarchical histological tissue type-annotated dataset for potential biomarker discovery of colorectal disease.Journal of Pathology Informatics(2025), 100537
2025
-
[34]
Min-Ling Zhang, Yu-Kun Li, Xu-Ying Liu, and Xin Geng. 2018. Binary relevance for multi-label learning: an overview.Frontiers of Computer Science12, 2 (2018), 191–202
2018
-
[35]
Xiaoman Zhang, Chaoyi Wu, Ya Zhang, Weidi Xie, and Yanfeng Wang. 2023. Knowledge- enhanced visual-language pre-training on chest radiology images.Nature Communications 14, 1 (2023), 4542. Contents of Appendix 7 Extended Literature Review 16 8 An Informal Walk-Through of Deferral Coherence 16 9 Alternative Deferral Semantics: Strong Subtree Handoff 18 9.1 ...
2023
-
[36]
Couldthemodel’sassertionsallbetruetogether?Ifnot,wehavea taxonomic contradiction
-
[37]
Is the model deferring a parent while still making a child-present claim that already commits the parent?If yes, we have adelegation violation
-
[38]
the expert should decide the parent
Isthemodeldeferringanodewhosevalueisalreadyimpliedbyitsownearlierassertions? If yes, we have adeductive defect . Taxonomic contradiction: saying something impossible.The familiar bad case isparent absent, child present. For example,Lung Opacity = absentandEdema = presentcannot both be true. Another way to say this is that, after leaving deferred nodes und...
-
[39]
do not assert an impossible parent–child combination
-
[40]
do not defer a question while answering it indirectly through a descendant
-
[41]
do not defer a label that your own assertions have already determined. Section 3 makes these rules precise, and Section 4 then builds estimators whose predictions respect this coherent decision space. 9 Alternative Deferral Semantics: Strong Subtree Handoff Our main method adopts theSelective-Exclusioncontract (Definition 3), which allows selectively excl...
-
[42]
If the parent is asserted absent, then the child is forced absent:P(𝑎𝑣 =1|𝑎 𝑝𝑎(𝑣)=0, 𝑥)=P(𝑎𝑣 =⊥| 𝑎𝑝𝑎(𝑣)=0, 𝑥)=0
-
[43]
If the parent is deferred, the child cannot be asserted present:P(𝑎𝑣 =1|𝑎 𝑝𝑎(𝑣)=⊥, 𝑥)=0. Proof. Both claims follow immediately from the corresponding rows of the transition matrix (18). The first row is[1,0,0], and the third row is[𝛼𝑣(𝑥),0,1−𝛼𝑣(𝑥)]. Proposition 5(Coherent handoff preserves label consistency).Let the completed system label be ˆ𝑦 sys 𝑣 (𝑎, ...
-
[44]
Contract-induced non-separability (gating).Under a deferral contract (Selective-Exclusion), internal-node actions change the feasible action space of descendants. Therefore, the Bayes- optimalcoherentpolicyisgenerallynotseparableacrossnodes,whereasmanyStageIobjectives are locally trained (e.g., BR-style L2D, per-node surrogates, or masked classification losses)
-
[45]
predict vs. defer
Composition/semantics mismatch.Stage I may optimise predictions under a hierarchy semantics that differs from TBP’scontract-constrained ternarycomposition. Even when Stage I is hierarchy-aware (e.g., constraint/closure-based HMLC surrogates such as MCLoss), the resulting parameters need not be stationary for the TBP-induced marginals that are used at infe...
-
[46]
taxonomic contradiction: parent absent and any child present 36
-
[47]
delegation violation: parent defer and any child present
-
[48]
deductive defect: parent absent, no child present, and any child defer
-
[49]
Edge-weighted incoherence.In addition to the neighbourhood partition, we also compute an edge-weighted view in which the unit of analysis is the immediate parent–child edge
coherent: all remaining neighbourhoods Finally, AU-Neigh-Any= ∫ 1 0 𝑅any(𝑏)𝑑𝑏, and similarly for the defect-specific neighbourhood-partition rates used throughout the paper. Edge-weighted incoherence.In addition to the neighbourhood partition, we also compute an edge-weighted view in which the unit of analysis is the immediate parent–child edge. Let ℰkept...
-
[50]
deferral coherence is still defined over ternary handoff actions rather than binary labels
-
[51]
Selective-Exclusion still forbids positive child assertions under absent or deferred parents
-
[52]
What changes is the inference problem
coherent projection and coherent-support joint action models can still be formulated exactly. What changes is the inference problem. Trees permit simple linear-time dynamic programs. DAGs require either exact bounded-treewidth inference, integer-programming decoders, or approximate structured inference. Thus, extending the present framework to realistic D...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.