pith. machine review for the scientific record. sign in

arxiv: 2604.13285 · v1 · submitted 2026-04-14 · 💻 cs.CL · cs.AI

Recognition: unknown

L2D-Clinical: Learning to Defer for Adaptive Model Selection in Clinical Text Classification

Authors on Pith no claims yet

Pith reviewed 2026-05-10 15:09 UTC · model grok-4.3

classification 💻 cs.CL cs.AI
keywords clinical text classificationlearning to deferBERTlarge language modelsadaptive model selectionadverse drug event detectionMIMICmodel complementarity
0
0 comments X

The pith

L2D-Clinical learns when a BERT classifier should hand clinical texts to an LLM, lifting F1 by up to 9 points while routing only 7-17 percent of cases.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents L2D-Clinical, a method that trains a separate model to decide, for each input, whether the output of a fine-tuned BERT should be used or whether the instance should be sent to a large language model instead. The decision rests on uncertainty scores produced by the BERT model together with surface features of the text. On adverse-drug-event detection the combined system reaches 0.928 F1 by deferring 7 percent of cases, a gain of 1.7 points over the best BERT. On treatment-outcome classification from MIMIC records the system reaches 0.980 F1 by deferring 16.8 percent of cases, a gain of 9.3 points over the best BERT. The central demonstration is that the two model families are complementary on different subsets of the data and that a lightweight policy can exploit the complementarity without paying the cost of the LLM on every example.

Core claim

L2D-Clinical trains a deferral policy that identifies the instances on which an LLM will outperform a BERT classifier. In the ADE task BioBERT alone achieves 0.911 F1 while the LLM achieves 0.765 F1; the deferral system reaches 0.928 F1 by sending 7 percent of instances to the LLM. In the MIMIC task ClinicalBERT alone achieves 0.887 F1 while GPT-5-nano achieves 0.967 F1; the deferral system reaches 0.980 F1 by sending 16.8 percent of instances to the LLM. The policy is learned from BERT uncertainty signals and text characteristics so that it improves accuracy precisely when the LLM supplies the complementary strength.

What carries the argument

The deferral classifier, which takes BERT uncertainty estimates and text features as input and outputs a binary decision to use the BERT prediction or to query the LLM.

If this is right

  • Overall F1 rises without requiring the LLM on every input.
  • The method works even when the BERT model is stronger on the majority of cases.
  • API cost is limited because only a small fraction of instances trigger an LLM call.
  • The same deferral idea can be applied to any pair of models whose strengths are complementary rather than one being uniformly superior.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If the deferral policy generalizes across hospitals or note formats, health systems could maintain a single lightweight selector instead of choosing one model family for all future data.
  • The approach suggests a broader pattern: pair a fast specialized model with a slower general model and learn the switch point rather than committing to one permanently.
  • Retraining the deferral policy on new data would be cheap compared with retraining either the BERT or the LLM.

Load-bearing premise

Signals from the BERT model and properties of the input text are sufficient to predict which future instances the LLM will classify more accurately than BERT.

What would settle it

Apply the trained deferral policy to a new collection of clinical notes from a different institution and check whether the instances it routes to the LLM actually show higher accuracy for the LLM than for BERT on that same set.

read the original abstract

Clinical text classification requires choosing between specialized fine-tuned models (BERT variants) and general-purpose large language models (LLMs), yet neither dominates across all instances. We introduce Learning to Defer for clinical text (L2D-Clinical), a framework that learns when a BERT classifier should defer to an LLM based on uncertainty signals and text characteristics. Unlike prior L2D work that defers to human experts assumed universally superior, our approach enables adaptive deferral-improving accuracy when the LLM complements BERT. We evaluate on two English clinical tasks: (1) ADE detection (ADE Corpus V2), where BioBERT (F1=0.911) outperforms the LLM (F1=0.765), and (2) treatment outcome classification (MIMIC-IV with multi-LLM consensus ground truth), where GPT-5-nano (F1=0.967) outperforms ClinicalBERT (F1=0.887). On ADE, L2D-Clinical achieves F1=0.928 (+1.7 points over BERT) by selectively deferring 7% of instances where the LLM's high recall compensates for BERT's misses. On MIMIC, L2D-Clinical achieves F1=0.980 (+9.3 points over BERT) by deferring only 16.8\% of cases to the LLM. The key insight is that L2D-Clinical learns to selectively leverage LLM strengths while minimizing API costs.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper introduces L2D-Clinical, a learning-to-defer framework for clinical text classification that trains a policy to decide when a fine-tuned BERT model should defer to an LLM. The policy uses uncertainty signals (e.g., entropy or confidence) and text characteristics as features. On the ADE detection task (ADE Corpus V2), BioBERT achieves F1=0.911 while the LLM reaches 0.765; L2D-Clinical reaches F1=0.928 by deferring 7% of instances. On MIMIC-IV treatment outcome classification, ClinicalBERT achieves F1=0.887 while GPT-5-nano reaches 0.967; L2D-Clinical reaches F1=0.980 by deferring 16.8% of cases. The approach aims to exploit model complementarity while controlling LLM API costs.

Significance. If the deferral policy generalizes reliably, the work could offer moderate practical value for clinical NLP deployments by enabling selective use of expensive LLMs only when they complement specialized models. This differs from prior L2D literature that assumes a human expert is universally superior. The reported gains (+1.7 F1 at low deferral on ADE; +9.3 F1 at moderate deferral on MIMIC) suggest potential efficiency benefits, but only if the policy avoids overfitting to training-set patterns and delivers consistent improvements on held-out clinical text.

major comments (3)
  1. [Methods] Methods section: The deferral classifier's training procedure, feature definitions (exact uncertainty signals and text characteristics), labeling strategy (which model is correct on training instances), and architecture are not described. This is load-bearing for the central claim, as the reported F1 gains depend on the policy correctly identifying instances where the LLM outperforms BERT on unseen data rather than memorizing training patterns.
  2. [Results] Results section: No ablation studies on individual features (uncertainty vs. text characteristics), no out-of-distribution testing, and no analysis of when the deferral policy fails are provided. Without these, it is impossible to confirm that the +1.7 F1 (ADE, 7% deferral) and +9.3 F1 (MIMIC, 16.8% deferral) improvements stem from learned complementarity rather than dataset-specific correlations.
  3. [Abstract] Abstract and evaluation: Data splits, error bars, confidence intervals, and statistical significance tests for the F1 scores are absent. The soundness assessment notes that full methods and splits are missing, which directly affects confidence in the concrete improvements claimed (F1=0.928 on ADE; F1=0.980 on MIMIC).
minor comments (2)
  1. [Abstract] Clarify the exact LLM variant (referred to as GPT-5-nano) and whether multi-LLM consensus ground truth on MIMIC introduces any label noise that could affect deferral training.
  2. A diagram illustrating the L2D-Clinical pipeline (BERT inference, feature extraction, deferral decision, LLM call) would improve readability of the framework.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback. We appreciate the emphasis on reproducibility, rigorous evaluation, and validation of the core claims. We address each major comment below and will revise the manuscript accordingly to incorporate the requested details and analyses.

read point-by-point responses
  1. Referee: [Methods] Methods section: The deferral classifier's training procedure, feature definitions (exact uncertainty signals and text characteristics), labeling strategy (which model is correct on training instances), and architecture are not described. This is load-bearing for the central claim, as the reported F1 gains depend on the policy correctly identifying instances where the LLM outperforms BERT on unseen data rather than memorizing training patterns.

    Authors: We agree that the Methods section in the submitted version lacks sufficient detail. In the revised manuscript, we will expand it to fully describe: (1) the exact uncertainty signals (entropy, maximum softmax probability, and prediction margin) and text characteristics (sequence length, type-token ratio, and presence of clinical keywords); (2) the labeling strategy, where each training instance is labeled by comparing BERT and LLM predictions against ground truth to identify which model is correct; (3) the deferral policy architecture (a gradient-boosted decision tree trained on the combined features); and (4) the full training procedure with hyperparameters. This will clarify how the policy learns complementarity rather than overfitting to training patterns. revision: yes

  2. Referee: [Results] Results section: No ablation studies on individual features (uncertainty vs. text characteristics), no out-of-distribution testing, and no analysis of when the deferral policy fails are provided. Without these, it is impossible to confirm that the +1.7 F1 (ADE, 7% deferral) and +9.3 F1 (MIMIC, 16.8% deferral) improvements stem from learned complementarity rather than dataset-specific correlations.

    Authors: We acknowledge these gaps and will strengthen the Results section. We will add ablation experiments that isolate uncertainty signals versus text characteristics to quantify each component's contribution. We will also include an out-of-distribution evaluation on an additional held-out clinical corpus and a dedicated failure analysis subsection examining cases where the policy defers incorrectly or fails to defer when beneficial. These additions will provide direct evidence that gains arise from learned model complementarity. revision: yes

  3. Referee: [Abstract] Abstract and evaluation: Data splits, error bars, confidence intervals, and statistical significance tests for the F1 scores are absent. The soundness assessment notes that full methods and splits are missing, which directly affects confidence in the concrete improvements claimed (F1=0.928 on ADE; F1=0.980 on MIMIC).

    Authors: We will update the Abstract to specify the data splits (80/10/10 train/validation/test) and report F1 scores with standard deviations across multiple runs. In the Experiments section, we will add bootstrapped confidence intervals and statistical significance tests (paired bootstrap test) comparing L2D-Clinical against the BERT and LLM baselines. These changes will directly address concerns about the reliability of the reported +1.7 and +9.3 F1 improvements. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected in derivation chain

full rationale

The paper presents L2D-Clinical as an empirical framework that trains a deferral classifier on uncertainty signals and text characteristics, with labels obtained from ground-truth comparisons of BERT vs. LLM performance on training instances. Reported F1 gains (+1.7 on ADE at 7% deferral; +9.3 on MIMIC at 16.8%) are measured on held-out test sets against fixed external baselines, without any equations, fitted parameters, or self-citations that reduce the improvements to definitional equivalence or imported uniqueness results. The derivation chain consists of standard supervised learning for deferral followed by direct performance evaluation, remaining self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that BERT and LLM exhibit complementary error patterns that can be predicted from uncertainty and text features; no free parameters or invented entities are introduced in the abstract.

axioms (1)
  • domain assumption BERT and LLM have complementary strengths on clinical instances that uncertainty signals can detect
    Stated in the abstract as the basis for selective deferral improving accuracy

pith-pipeline@v0.9.0 · 5567 in / 1176 out tokens · 32928 ms · 2026-05-10T15:09:20.535659+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

16 extracted references · 4 canonical work pages · 2 internal anchors

  1. [1]

    L2D-Clinical: Learning to Defer for Adaptive Model Selection in Clinical Text Classification

    Introduction Clinical text classification is fundamental to health- care NLP applications, from adverse drug event (ADE) detection for pharmacovigilance (Henry et al., 2020; Karimi et al., 2015) to treatment out- come extraction for clinical decision support (Xu and Wang, 2022). The choice between specialized fine-tunedmodels(e.g.,BioBERT(Leeetal.,2020), ...

  2. [2]

    For example, our LLM achieves only F1=0.765 on ADE detection vs

    AI-to-AI deferral: Unlike prior L2D work that defers to human experts, we study deferral between AI systems (BERT→ LLM), where the “expert” is not universally superior. For example, our LLM achieves only F1=0.765 on ADE detection vs. BERT’s 0.911, yet selective deferral still improves overall performance

  3. [3]

    Adaptivedeferralbehavior: Wedemonstrate that L2D-Clinical adapts to different model dynamics—improving accuracy when deferral helps (ADE: +1.7 F1 points, MIMIC: +9.3 F1 points) while minimizing LLM usage (7% and 16.8% deferral rates respectively)

  4. [4]

    Human expert validation on a 5% subset con- firmed 100% label accuracy

    Consensus ground truth for MIMIC: We introduce a multi-LLM consensus labeling methodology (GPT-5.2 + Claude 4.5 agree- ment) that produces higher-quality ground truth than single-model annotation, with 83.8% agreement rate yielding 2,782 reliable labels. Human expert validation on a 5% subset con- firmed 100% label accuracy

  5. [5]

    Clinical notes at scale: Experiments on real MIMIC-IV discharge summaries (279 test sam- ples)alongside the ADEbenchmark(500sam- ples), demonstrating applicability to actual EHR data

  6. [6]

    Interpretable patterns: Analysis reveals task- specific features (e.g., causal language, out- come keywords) that predict when each model excels, supporting clinical validation

  7. [7]

    BERT-likemodels excel in pattern-driven tasks, while LLMs dominate those requiring deep semantics

    Related Work Our work draws on three research threads: clinical text classification methods, the learning to defer framework, and hybrid NLP systems that combine multiple models for efficient inference. 2.1. Clinical Text Classification The extraction of ADEs from clinical text has been extensively studied through shared tasks and benchmark datasets. The ...

  8. [8]

    Symptoms improved after switching to lisinopril,

    Method Building on these foundations, we develop a lightweight deferral model that routes inputs be- tween BERT and the LLM. The workflow proceeds as follows: (1) BERT classifies the input text and produces softmax probabilities; (2) the deferral model examines these probabilities along with text characteristics to predict whether BERT is likely wrong; (3...

  9. [9]

    You classify sentences for adverse drug events. Reply with only 1 (contains ADE) or 0 (no ADE)

    Experimental Setup Having established the L2D-Clinical framework, we now describe the experimental design used to eval- uateitseffectivenessacrossdifferentmodeldynam- ics. The ADE Corpus V2 represents a scenario where the domain-adapted BERT model outper- forms the LLM, while MIMIC-IV treatment outcome classification represents the opposite, where the LLM...

  10. [10]

    Patient noted some stomach discomfort which may be related to the medica- tion

    Results We present results on both tasks, beginning with ADE detection where BioBERT substantially out- performs the LLM (F1: 0.911 vs. 0.765), followed by treatment outcome classification where GPT- 5-nano outperforms ClinicalBERT (F1: 0.967 vs. 0.887). In both cases, L2D-Clinical improves over both individual models by learning when to defer. 5.1. Task ...

  11. [11]

    Discussion The experimental results raise a natural question: why does L2D-Clinical succeed in improving accu- racy even when deferring to a weaker model? We now analyze the underlying mechanisms and dis- cuss practical implications for clinical deployment. 6.1. Why L2D-Clinical Works The central finding, that L2D-Clinical adapts its de- ferralbehaviortom...

  12. [12]

    Conclusion We have presented Learning to Defer for clinical text (L2D-Clinical), demonstrating that learned de- ferral adapts to different model dynamics. On ADE detection, where BioBERT outperforms the LLM but complementary error patterns exist, L2D- Clinical achieves F1=0.928 (+1.7 points over BERT alone) by selectively deferring 7% of instances. On MIM...

  13. [13]

    FrugalGPT: How to Use Large Language Models While Reducing Cost and Improving Performance

    Frugalgpt: How to use large language models while reducing cost and improving per- formance. InarXiv preprint arXiv:2305.05176. Harsha Gurulingappa, Abdul Mateen Rajput, An- gus Roberts, Juliane Fluck, Martin Hofmann- Apitius, and Luca Toldo. 2012. Development of a benchmark corpus to support the automatic extraction of drug-related adverse effects from m...

  14. [14]

    The algorithmic automation problem: Prediction, triage, and human effort

    Predict responsibly: Improving fairness and accuracy by learning to defer. InAdvances in Neural Information Processing Systems, vol- ume 31. Hussein Mozannar and David Sontag. 2020. Con- sistent estimators for learning to defer to an ex- pert. InInternational Conference on Machine Learning, pages 7076–7087. PMLR. HusseinMozannarandDavidSontag.2023. Teach-...

  15. [15]

    InProceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 6640–6651

    The right tool for the job: Matching model and instance complexities. InProceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 6640–6651. Swabha Swayamdipta, Roy Schwartz, Nicholas Lourie, Yizhong Wang, Hannaneh Hajishirzi, NoahASmith,andYejinChoi.2020. Datasetcar- tography: Mapping and diagnosing datasets with trai...

  16. [16]

    We discuss the limitations for our infrastructure along with its latency and how we plan on presenting more ideas in the future

    Appendix Sections in this appendix provide more insight into the work completed. We discuss the limitations for our infrastructure along with its latency and how we plan on presenting more ideas in the future. 8.1. Limitations Several limitations of this work should be acknowl- edged. First, our test set sizes are moderate (500 and 279 samples) due to LLM...