Training-Free Adaptation of New-Generation LLMs using Legacy Clinical Models
Pith reviewed 2026-05-16 16:22 UTC · model grok-4.3
The pith
CAPT adapts new general-domain LLMs to clinical tasks using legacy models without any retraining.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
CAPT is a model-ensembling approach that enables training-free adaptation of state-of-the-art general-domain models using existing clinical models. It supports models with disjoint vocabularies by leveraging contrastive decoding to selectively inject clinically relevant signals while preserving the general-domain model's reasoning and fluency. On six clinical classification and text-generation tasks, CAPT with a new-generation general-domain model and an older-generation clinical model consistently outperforms both models individually and state-of-the-art ensembling approaches.
What carries the argument
Contrastive decoding applied to cross-architecture proxy tuning, which selectively amplifies tokens preferred by the clinical model over the general model.
If this is right
- Healthcare institutions can use the latest general LLMs for clinical work without needing to retrain them on clinical data.
- Performance improves on both classification and text generation tasks in medicine.
- CAPT reduces context errors and increases clinical specificity in outputs.
- The method benefits places with limited computing resources that cannot afford repeated training.
Where Pith is reading between the lines
- Similar proxy tuning could apply to other specialized domains like law or finance using legacy models.
- Future work might test if CAPT scales to even newer model generations or larger clinical datasets.
- Physician case studies suggest it could lead to more reliable AI assistants in real medical settings.
Load-bearing premise
That contrastive decoding can selectively inject clinically relevant signals from the legacy model while preserving the general-domain model's reasoning and fluency without introducing new errors or degrading performance.
What would settle it
Observing that the combined model produces more clinical errors or lower accuracy than the new general model alone on additional medical tasks would show the selective injection is not working as claimed.
read the original abstract
Adapting language models to the clinical domain through continued pretraining and instruction tuning requires costly retraining for each new model generation. We propose Cross-Architecture Proxy Tuning (CAPT), a model-ensembling approach that enables training-free adaptation of state-of-the-art general-domain models using existing clinical models. CAPT supports models with disjoint vocabularies, leveraging contrastive decoding to selectively inject clinically relevant signals while preserving the general-domain model's reasoning and fluency. On six clinical classification and text-generation tasks, CAPT with a new-generation general-domain model and an older-generation clinical model consistently outperforms both models individually and state-of-the-art ensembling approaches (average +17.6\% over UniTE, +41.4\% over proxy tuning across tasks). Through token-level analysis and physician case studies, we demonstrate that CAPT amplifies clinically actionable language, reduces context errors, and increases clinical specificity. This technique especially benefits healthcare institutions with constrained computational capacity that cannot support iterative clinical training and want to adopt emerging general-domain model advances.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes Cross-Architecture Proxy Tuning (CAPT), a training-free ensembling method that adapts new-generation general-domain LLMs to clinical tasks by using contrastive decoding with legacy clinical models, supporting disjoint vocabularies. It reports that CAPT outperforms both base models and SOTA ensembling baselines (UniTE and proxy tuning) on six clinical classification and text-generation tasks, with average gains of +17.6% and +41.4% respectively, supported by token-level analysis and physician case studies showing reduced context errors and higher clinical specificity.
Significance. If the results hold under rigorous validation, the work has clear practical significance for resource-constrained healthcare settings that wish to adopt frontier general-domain models without repeated clinical-domain retraining. The training-free, cross-architecture design is a genuine strength, and the inclusion of physician case studies provides a useful qualitative dimension. Credit is given for focusing on a concrete deployment constraint rather than purely architectural novelty.
major comments (2)
- [Abstract] Abstract: the headline quantitative claims (+17.6% over UniTE, +41.4% over proxy tuning) are presented without any description of the six tasks, baseline implementations, number of runs, error bars, or statistical tests. This absence makes the central empirical claim impossible to assess for robustness.
- [Token-level analysis and physician case studies] Token-level analysis and physician case studies: these are offered as evidence that contrastive decoding selectively amplifies clinically relevant signals without introducing new errors. However, the manuscript contains no controlled disagreement set or error-rate measurement on cases where the legacy clinical model and new general-domain model conflict on medical facts (updated guidelines, rare conditions, temporal changes), leaving the core assumption untested.
minor comments (2)
- [Abstract] Abstract: the term 'state-of-the-art ensembling approaches' should explicitly list all compared methods rather than only naming UniTE and proxy tuning.
- [Methods] Methods: the precise mechanism for handling disjoint vocabularies during contrastive decoding (token mapping, logit alignment, etc.) requires a clearer algorithmic description or pseudocode.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback and for recognizing the practical value of CAPT for resource-constrained clinical settings. We address each major comment below and will incorporate revisions where they strengthen the manuscript without misrepresenting our existing results.
read point-by-point responses
-
Referee: [Abstract] Abstract: the headline quantitative claims (+17.6% over UniTE, +41.4% over proxy tuning) are presented without any description of the six tasks, baseline implementations, number of runs, error bars, or statistical tests. This absence makes the central empirical claim impossible to assess for robustness.
Authors: We agree that the abstract would be more informative with additional context. In the revised version we will expand the abstract by one sentence to name the six tasks (three classification tasks on MIMIC-III and three generation tasks on radiology reports), state that all results are averaged over three independent runs with standard deviations reported, and note that gains are statistically significant under paired t-tests (p < 0.05). Full baseline implementations, hyper-parameters, and evaluation protocols already appear in Sections 3 and 4; the abstract change will simply point readers to these details. revision: yes
-
Referee: [Token-level analysis and physician case studies] Token-level analysis and physician case studies: these are offered as evidence that contrastive decoding selectively amplifies clinically relevant signals without introducing new errors. However, the manuscript contains no controlled disagreement set or error-rate measurement on cases where the legacy clinical model and new general-domain model conflict on medical facts (updated guidelines, rare conditions, temporal changes), leaving the core assumption untested.
Authors: The referee correctly notes that our current qualitative analyses do not include a controlled quantitative study of model disagreements on evolving medical facts. We will add a new subsection (Section 5.4) that constructs a curated disagreement set of 50 cases drawn from updated guidelines and rare conditions, measures per-model and CAPT error rates on this set, and reports the fraction of cases in which CAPT resolves the conflict in favor of the clinically correct answer. This addition will directly test the assumption while remaining within the scope of a major revision. revision: yes
Circularity Check
No circularity: CAPT is an empirical ensembling proposal validated on held-out tasks
full rationale
The paper presents CAPT as a new contrastive-decoding ensembling technique that combines a general-domain LLM with a legacy clinical model without retraining. All central claims rest on direct empirical measurements across six classification and generation tasks, with explicit comparisons to baselines such as UniTE and proxy tuning. No equations, parameters, or uniqueness theorems are defined in terms of the target performance metrics, and no self-citation chain is used to justify the core mechanism. The reported token-level analysis and case studies are post-hoc diagnostics rather than inputs that the method is constructed to reproduce. The derivation chain is therefore self-contained and externally falsifiable.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Contrastive decoding selectively amplifies clinically relevant signals from the legacy model while preserving general-domain capabilities.
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
s(i) = logp_new(i|x1:t) + α (logp_old-clin(f(i)|x1:t) − logp_old(f(i)|x1:t))
-
IndisputableMonolith/Foundation/ArithmeticFromLogic.leanLogicNat ≃ Nat recovery unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
CAPT supports models with disjoint vocabularies, leveraging contrastive decoding
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.