Training-Free Adaptation of New-Generation LLMs using Legacy Clinical Models

Amir Ronaghi; Asad Aali; Chloe Stanwyck; Emily Alsentzer; Miguel Fuentes; Sasha Ronaghi; Tina Hernandez-Boussard

arxiv: 2601.03423 · v3 · submitted 2026-01-06 · 💻 cs.CL · cs.AI

Training-Free Adaptation of New-Generation LLMs using Legacy Clinical Models

Sasha Ronaghi , Chloe Stanwyck , Asad Aali , Amir Ronaghi , Miguel Fuentes , Tina Hernandez-Boussard , Emily Alsentzer This is my paper

Pith reviewed 2026-05-16 16:22 UTC · model grok-4.3

classification 💻 cs.CL cs.AI

keywords clinical NLPLLM adaptationcontrastive decodingmodel ensemblingtraining-free methodshealthcare AIproxy tuning

0 comments

The pith

CAPT adapts new general-domain LLMs to clinical tasks using legacy models without any retraining.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes Cross-Architecture Proxy Tuning (CAPT) as a way to adapt the latest general language models to medicine by drawing on existing older clinical models. This approach avoids the expensive process of retraining every new model generation for the clinical domain. CAPT works across models with different vocabularies by using contrastive decoding to boost clinically relevant outputs while keeping the new model's strengths in reasoning and language fluency. Tests on six clinical tasks show it beats the individual models and standard ensembling techniques.

Core claim

CAPT is a model-ensembling approach that enables training-free adaptation of state-of-the-art general-domain models using existing clinical models. It supports models with disjoint vocabularies by leveraging contrastive decoding to selectively inject clinically relevant signals while preserving the general-domain model's reasoning and fluency. On six clinical classification and text-generation tasks, CAPT with a new-generation general-domain model and an older-generation clinical model consistently outperforms both models individually and state-of-the-art ensembling approaches.

What carries the argument

Contrastive decoding applied to cross-architecture proxy tuning, which selectively amplifies tokens preferred by the clinical model over the general model.

If this is right

Healthcare institutions can use the latest general LLMs for clinical work without needing to retrain them on clinical data.
Performance improves on both classification and text generation tasks in medicine.
CAPT reduces context errors and increases clinical specificity in outputs.
The method benefits places with limited computing resources that cannot afford repeated training.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar proxy tuning could apply to other specialized domains like law or finance using legacy models.
Future work might test if CAPT scales to even newer model generations or larger clinical datasets.
Physician case studies suggest it could lead to more reliable AI assistants in real medical settings.

Load-bearing premise

That contrastive decoding can selectively inject clinically relevant signals from the legacy model while preserving the general-domain model's reasoning and fluency without introducing new errors or degrading performance.

What would settle it

Observing that the combined model produces more clinical errors or lower accuracy than the new general model alone on additional medical tasks would show the selective injection is not working as claimed.

read the original abstract

Adapting language models to the clinical domain through continued pretraining and instruction tuning requires costly retraining for each new model generation. We propose Cross-Architecture Proxy Tuning (CAPT), a model-ensembling approach that enables training-free adaptation of state-of-the-art general-domain models using existing clinical models. CAPT supports models with disjoint vocabularies, leveraging contrastive decoding to selectively inject clinically relevant signals while preserving the general-domain model's reasoning and fluency. On six clinical classification and text-generation tasks, CAPT with a new-generation general-domain model and an older-generation clinical model consistently outperforms both models individually and state-of-the-art ensembling approaches (average +17.6\% over UniTE, +41.4\% over proxy tuning across tasks). Through token-level analysis and physician case studies, we demonstrate that CAPT amplifies clinically actionable language, reduces context errors, and increases clinical specificity. This technique especially benefits healthcare institutions with constrained computational capacity that cannot support iterative clinical training and want to adopt emerging general-domain model advances.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

CAPT gives a practical training-free way to blend new general LLMs with old clinical models via contrastive decoding, but the abstract leaves the experimental support too thin to judge the gains.

read the letter

The core idea is straightforward: use contrastive decoding to let a new general-domain LLM borrow clinically relevant signals from an older clinical model without any retraining or shared vocabulary. This CAPT approach is presented as an extension of proxy tuning that works across architectures, and the abstract reports solid average lifts on six classification and generation tasks—roughly 17% over UniTE and 41% over standard proxy tuning—plus token-level checks and physician case studies that point to better specificity and fewer context mistakes. That combination is the real draw for groups that cannot afford to keep retraining every new model release. The work is honest about its target setting and sticks to measurable tasks rather than vague claims. The main weakness is the lack of detail in the abstract on baselines, variance, or statistical tests, which makes it difficult to assess whether the reported improvements hold up under closer inspection. The stress-test concern about the legacy model injecting outdated medical facts is reasonable on the surface, but the abstract does mention case studies showing reduced errors, so the risk may be smaller than feared if the full paper controls for disagreements. Overall this is aimed at applied clinical NLP readers who care about low-compute adaptation rather than pure theory. It deserves peer review because the method is concrete and the results, if they survive scrutiny on the full experiments, would be useful to practitioners.

Referee Report

2 major / 2 minor

Summary. The paper proposes Cross-Architecture Proxy Tuning (CAPT), a training-free ensembling method that adapts new-generation general-domain LLMs to clinical tasks by using contrastive decoding with legacy clinical models, supporting disjoint vocabularies. It reports that CAPT outperforms both base models and SOTA ensembling baselines (UniTE and proxy tuning) on six clinical classification and text-generation tasks, with average gains of +17.6% and +41.4% respectively, supported by token-level analysis and physician case studies showing reduced context errors and higher clinical specificity.

Significance. If the results hold under rigorous validation, the work has clear practical significance for resource-constrained healthcare settings that wish to adopt frontier general-domain models without repeated clinical-domain retraining. The training-free, cross-architecture design is a genuine strength, and the inclusion of physician case studies provides a useful qualitative dimension. Credit is given for focusing on a concrete deployment constraint rather than purely architectural novelty.

major comments (2)

[Abstract] Abstract: the headline quantitative claims (+17.6% over UniTE, +41.4% over proxy tuning) are presented without any description of the six tasks, baseline implementations, number of runs, error bars, or statistical tests. This absence makes the central empirical claim impossible to assess for robustness.
[Token-level analysis and physician case studies] Token-level analysis and physician case studies: these are offered as evidence that contrastive decoding selectively amplifies clinically relevant signals without introducing new errors. However, the manuscript contains no controlled disagreement set or error-rate measurement on cases where the legacy clinical model and new general-domain model conflict on medical facts (updated guidelines, rare conditions, temporal changes), leaving the core assumption untested.

minor comments (2)

[Abstract] Abstract: the term 'state-of-the-art ensembling approaches' should explicitly list all compared methods rather than only naming UniTE and proxy tuning.
[Methods] Methods: the precise mechanism for handling disjoint vocabularies during contrastive decoding (token mapping, logit alignment, etc.) requires a clearer algorithmic description or pseudocode.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback and for recognizing the practical value of CAPT for resource-constrained clinical settings. We address each major comment below and will incorporate revisions where they strengthen the manuscript without misrepresenting our existing results.

read point-by-point responses

Referee: [Abstract] Abstract: the headline quantitative claims (+17.6% over UniTE, +41.4% over proxy tuning) are presented without any description of the six tasks, baseline implementations, number of runs, error bars, or statistical tests. This absence makes the central empirical claim impossible to assess for robustness.

Authors: We agree that the abstract would be more informative with additional context. In the revised version we will expand the abstract by one sentence to name the six tasks (three classification tasks on MIMIC-III and three generation tasks on radiology reports), state that all results are averaged over three independent runs with standard deviations reported, and note that gains are statistically significant under paired t-tests (p < 0.05). Full baseline implementations, hyper-parameters, and evaluation protocols already appear in Sections 3 and 4; the abstract change will simply point readers to these details. revision: yes
Referee: [Token-level analysis and physician case studies] Token-level analysis and physician case studies: these are offered as evidence that contrastive decoding selectively amplifies clinically relevant signals without introducing new errors. However, the manuscript contains no controlled disagreement set or error-rate measurement on cases where the legacy clinical model and new general-domain model conflict on medical facts (updated guidelines, rare conditions, temporal changes), leaving the core assumption untested.

Authors: The referee correctly notes that our current qualitative analyses do not include a controlled quantitative study of model disagreements on evolving medical facts. We will add a new subsection (Section 5.4) that constructs a curated disagreement set of 50 cases drawn from updated guidelines and rare conditions, measures per-model and CAPT error rates on this set, and reports the fraction of cases in which CAPT resolves the conflict in favor of the clinically correct answer. This addition will directly test the assumption while remaining within the scope of a major revision. revision: yes

Circularity Check

0 steps flagged

No circularity: CAPT is an empirical ensembling proposal validated on held-out tasks

full rationale

The paper presents CAPT as a new contrastive-decoding ensembling technique that combines a general-domain LLM with a legacy clinical model without retraining. All central claims rest on direct empirical measurements across six classification and generation tasks, with explicit comparisons to baselines such as UniTE and proxy tuning. No equations, parameters, or uniqueness theorems are defined in terms of the target performance metrics, and no self-citation chain is used to justify the core mechanism. The reported token-level analysis and case studies are post-hoc diagnostics rather than inputs that the method is constructed to reproduce. The derivation chain is therefore self-contained and externally falsifiable.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The approach assumes contrastive decoding can isolate and amplify domain-specific signals across architectures without side effects; no free parameters or invented entities are explicitly introduced in the abstract.

axioms (1)

domain assumption Contrastive decoding selectively amplifies clinically relevant signals from the legacy model while preserving general-domain capabilities.
Core mechanism invoked to justify the training-free adaptation.

pith-pipeline@v0.9.0 · 5496 in / 1103 out tokens · 43793 ms · 2026-05-16T16:22:59.193991+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

s(i) = logp_new(i|x1:t) + α (logp_old-clin(f(i)|x1:t) − logp_old(f(i)|x1:t))
IndisputableMonolith/Foundation/ArithmeticFromLogic.lean LogicNat ≃ Nat recovery unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

CAPT supports models with disjoint vocabularies, leveraging contrastive decoding

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.