Trade-offs in Medical LLM Adaptation: An Empirical Study in French QA

Benoit Favre; Carlos Ramisch; Frederic Bechet; Ikram Belmadani; Oumaima El Khettari; Richard Dufour

arxiv: 2606.19266 · v1 · pith:SF7Z7QCVnew · submitted 2026-06-17 · 💻 cs.CL · cs.AI

Trade-offs in Medical LLM Adaptation: An Empirical Study in French QA

Ikram Belmadani , Oumaima El Khettari , Carlos Ramisch , Frederic Bechet , Richard Dufour , Benoit Favre This is my paper

Pith reviewed 2026-06-26 20:58 UTC · model grok-4.3

classification 💻 cs.CL cs.AI

keywords medical LLM adaptationFrench QAcontinual pretrainingsupervised fine-tuningMCQAOEQAdomain adaptationLLM evaluation

0 comments

The pith

For French medical multiple-choice QA, supervised fine-tuning alone matches or nearly matches the gains from adding continual pretraining, while open-ended QA shows different trade-offs.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper compares three adaptation strategies—continual pretraining, supervised fine-tuning, and their combination—on French medical question-answering tasks using multiple model families and sizes. It evaluates performance separately on multiple-choice questions and open-ended questions with both automatic metrics and LLM-based judging. The central finding is that for multiple-choice questions the combined approach wins most often but the extra benefit over fine-tuning alone is small and often statistically insignificant, while for open-ended questions continual pretraining helps overlap metrics yet fine-tuning tends to lower generation quality. Cross-lingual tests show that French-adapted models also improve on English medical benchmarks. These results are meant to guide practical choices when compute is limited.

Core claim

Across three model families, CPT+SFT most often records the highest scores on multiple-choice medical QA, yet the margin over SFT alone is small and frequently fails to reach statistical significance; therefore SFT constitutes a strong, lower-cost default. On open-ended QA, CPT reliably raises overlap-based automatic metrics, whereas SFT tends to reduce generation quality; instruction tuning and the combined CPT+SFT route receive higher preference under LLM-as-a-Judge evaluation. Adaptation performed on French data transfers positively to English medical benchmarks.

What carries the argument

Head-to-head comparison of continual pretraining (CPT), supervised fine-tuning (SFT), and CPT+SFT on French medical MCQA and OEQA datasets, disentangling effects from base model choice and initialization.

If this is right

Under compute constraints, practitioners can default to SFT for multiple-choice medical QA without expecting large further gains from CPT.
For open-ended generation, CPT should be retained if overlap metrics matter, while SFT may need to be skipped or paired with instruction tuning.
French-domain adaptation produces measurable positive transfer to English medical QA benchmarks.
Model size and family interact with adaptation strategy, so results should be checked per architecture rather than assumed universal.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same trade-off pattern may appear in other low-resource languages when medical corpora are similarly scarce.
If clinical deployment values factual correctness over surface overlap, the preference for instruction-tuned or combined routes on OEQA would matter more than the automatic scores suggest.
Future work could test whether the small CPT+SFT margin persists when the base models are already instruction-tuned on general medical English data.

Load-bearing premise

The selected French medical QA datasets together with the automatic metrics and LLM-as-a-Judge protocol give unbiased, representative measures of adaptation quality and clinical utility.

What would settle it

A human expert evaluation of the same model outputs on a held-out clinical French QA set that contradicts the ranking produced by the LLM judge or by the overlap metrics.

Figures

Figures reproduced from arXiv: 2606.19266 by Benoit Favre, Carlos Ramisch, Frederic Bechet, Ikram Belmadani, Oumaima El Khettari, Richard Dufour.

**Figure 2.** Figure 2: Difference in EM accuracy (∆EM) between native English MCQU test benchmarks and their French translations across model families and adaptation strategies (constrained decoding). reports averaged results across datasets using constrained decoding; full results for both greedy and constrained decoding are provided in Appendix M. For the Mistral family, base models consistently perform better on the translat… view at source ↗

**Figure 3.** Figure 3: Relationship between accuracy gain (∆EM) and change in predictive entropy (∆Entropy) when moving from the translated to native benchmarks. Each point corresponds to a model configuration. benchmark, MedMCQA (Pal et al., 2022), using accuracy and confidence-based metrics. Both benchmarks consist of MCQUs of comparable size, for fair comparison. Although instances are not shared, consistent differences acr… view at source ↗

**Figure 4.** Figure 4: Probability-level metrics for MCQ and MCQU [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

**Figure 5.** Figure 5: Relationship between accuracy gain (∆EM) and change in confidence on incorrect predictions (∆pmax,wrong) between the translated and native benchmarks. Positive values of ∆pmax,wrong indicate increased confidence on incorrect predictions. O Computational Resources and Environmental Impact [PITH_FULL_IMAGE:figures/full_fig_p021_5.png] view at source ↗

read the original abstract

The development of large language models (LLMs) has led to an increased focus on their adaptation to specialized domains and languages, yet the effectiveness of domain adaptation strategies remains unclear. We present a study of medical domain adaptation using French medical question-answering (QA) as a case study. We compare continual pretraining (CPT), supervised fine-tuning (SFT), and their combination across three model families, multiple sizes, and three initialization types, explicitly disentangling adaptation effects from base model choice. We evaluate both multiple-choice (MCQA) and open-ended QA (OEQA) under greedy and constrained decoding using automatic metrics and LLM-as-a-Judge evaluation. For MCQA, CPT+SFT most often achieves the best scores, but gains over SFT are small and frequently not statistically significant, making SFT a strong and cost-effective default. For OEQA, CPT consistently improves overlap-based metrics, while SFT often degrades generation quality; instruction tuning and CPT+SFT are preferred by LLM-based evaluation. Cross-lingual experiments further show effective transfer from French adaptation to English benchmarks. Overall, we provide practical guidelines for selecting adaptation strategies under computational constraints.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

SFT alone is often enough for French medical MCQA while CPT adds limited value, with more mixed results for open-ended QA in this systematic but narrow empirical study.

read the letter

The main point is that for French medical multiple-choice QA, supervised fine-tuning gets most of the gains and adding continual pretraining on top rarely moves the needle enough to matter, while open-ended QA shows CPT helping overlap metrics but SFT sometimes hurting generation quality under the LLM judge.

The paper runs the same set of adaptations across three model families, multiple sizes, and different starting points. That disentangling is the useful part because it separates strategy effects from base model choice. They cover both MCQA and OEQA, test greedy and constrained decoding, report automatic metrics plus LLM-as-a-Judge, and include a cross-lingual check to English benchmarks. The hedging on small or insignificant gains for MCQA is straightforward and avoids overclaiming.

What it does well is stay empirical and comparative at a decent scale. The preference ordering under the judge for OEQA and the note that SFT is cost-effective are the kind of practical observations that can save people compute when adapting to a new language and domain.

The soft spots are standard for this type of work. LLM-as-a-Judge is known to be imperfect for medical content, the tasks are limited to QA, and everything is French-specific so the guidelines may not travel far. The abstract flags statistical tests but leaves dataset sizes, exact methods, and error bars unclear, which makes it harder to assess robustness without the full text. No load-bearing flaws show up in the reported claims, just the usual limits on how far the results generalize.

This is for groups doing medical LLM work in non-English settings who need quick rules of thumb on adaptation order. A reader focused on compute-efficient domain adaptation would get some usable signals. It deserves a serious referee because the comparisons are broad enough to check and the tempered conclusions make the paper worth the time even if the core pattern is not surprising.

Referee Report

2 major / 2 minor

Summary. The paper presents an empirical study of medical domain adaptation for French QA, comparing continual pretraining (CPT), supervised fine-tuning (SFT), and CPT+SFT across three model families, multiple sizes, and initialization types. It disentangles adaptation effects from base model choice and evaluates MCQA and OEQA using automatic metrics plus LLM-as-a-Judge under greedy and constrained decoding, plus cross-lingual transfer to English. Key findings are that CPT+SFT most often wins on MCQA but gains over SFT are small and frequently insignificant (making SFT a cost-effective default), while CPT improves overlap metrics on OEQA but SFT can degrade generation quality (with instruction tuning and CPT+SFT preferred by LLM judges).

Significance. If the comparative results hold, the work supplies actionable guidelines for choosing adaptation strategies under compute constraints in specialized domains. Strengths include the explicit disentanglement of adaptation from base-model effects, the hedging on statistical significance of small gains, the distinction between metric types (overlap vs. LLM judge), and the cross-lingual transfer experiments; these elements make the ablation more informative than typical single-model studies.

major comments (2)

[Methods / Results] Methods section: the abstract and results claim statistical significance tests and cross-lingual transfer, yet no details appear on dataset sizes, exclusion criteria, exact tests (e.g., paired t-test or bootstrap), multiple-comparison correction, or error bars; this information is load-bearing for the central claim that CPT+SFT gains over SFT are “frequently not statistically significant.”
[Evaluation / OEQA results] Evaluation protocol (OEQA subsection): reliance on LLM-as-a-Judge without a reported human validation subset or inter-annotator agreement on a medical sample risks systematic bias, especially since the paper notes imperfect correlation with human judgment; this directly affects the claim that instruction tuning and CPT+SFT are preferred.

minor comments (2)

[Tables] Table captions and axis labels should explicitly state the number of runs or seeds used for each reported score to allow readers to assess variability.
[Cross-lingual experiments] The cross-lingual transfer paragraph would benefit from a short statement on whether the English benchmarks were seen during any pretraining stage of the base models.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the positive assessment of the manuscript's significance and for the constructive feedback. We address each major comment below and will revise the manuscript accordingly.

read point-by-point responses

Referee: [Methods / Results] Methods section: the abstract and results claim statistical significance tests and cross-lingual transfer, yet no details appear on dataset sizes, exclusion criteria, exact tests (e.g., paired t-test or bootstrap), multiple-comparison correction, or error bars; this information is load-bearing for the central claim that CPT+SFT gains over SFT are “frequently not statistically significant.”

Authors: We agree that the statistical analysis details were insufficiently reported. The revised manuscript will add the missing information, including dataset sizes for each experiment and language, exclusion criteria for QA samples, the precise tests used (paired t-tests with bootstrap resampling), multiple-comparison correction (Bonferroni), and error bars or confidence intervals on all relevant figures and tables. These additions will directly support the claims about the (non-)significance of CPT+SFT gains over SFT alone. revision: yes
Referee: [Evaluation / OEQA results] Evaluation protocol (OEQA subsection): reliance on LLM-as-a-Judge without a reported human validation subset or inter-annotator agreement on a medical sample risks systematic bias, especially since the paper notes imperfect correlation with human judgment; this directly affects the claim that instruction tuning and CPT+SFT are preferred.

Authors: We acknowledge the risk of systematic bias when relying solely on LLM-as-a-Judge for medical content. The original text already notes the imperfect correlation with human judgment. In revision we will add a human validation study on a representative subset of the OEQA outputs, reporting inter-annotator agreement, to strengthen the preference claims for instruction tuning and CPT+SFT. If the added study confirms the LLM-judge rankings, we will state this explicitly; otherwise we will qualify the claims accordingly. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical ablation study

full rationale

The paper is an empirical comparison of CPT, SFT, and CPT+SFT across model families, sizes, and initialization types on French medical QA datasets. All reported results (MCQA/OEQA scores, statistical significance, LLM-as-Judge preferences, cross-lingual transfer) are direct experimental outcomes using external benchmarks and standard metrics. No derivations, equations, fitted parameters renamed as predictions, or load-bearing self-citations appear in the described content. The disentanglement of adaptation effects is achieved through controlled ablations rather than any definitional or self-referential reduction. This is a standard, self-contained empirical study.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The study rests on standard LLM adaptation techniques and evaluation practices already established in prior literature; no new free parameters, axioms, or invented entities are introduced in the abstract.

axioms (1)

domain assumption Overlap-based automatic metrics and LLM-as-a-Judge are valid proxies for medical QA quality.
Invoked when reporting OEQA results and LLM-based preferences.

pith-pipeline@v0.9.1-grok · 5752 in / 1282 out tokens · 29515 ms · 2026-06-26T20:58:31.742830+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

12 extracted references · 6 canonical work pages · 2 internal anchors

[1]

Aryo Gema, Pasquale Minervini, Luke Daines, Tom Hope, and Beatrice Alex

Medical mt5: An open-source multilingual text-to-text llm for the medical domain.Preprint, arXiv:2404.07613. Aryo Gema, Pasquale Minervini, Luke Daines, Tom Hope, and Beatrice Alex. 2024. Parameter-efficient fine-tuning of LLaMA for the clinical domain. In Proceedings of the 6th Clinical Natural Language Processing Workshop, pages 91–104, Mexico City, Mex...

work page arXiv 2024
[2]

Measuring Massive Multitask Language Understanding

Measuring massive multitask language under- standing.Preprint, arXiv:2009.03300. Edward J Hu, yelong shen, Phillip Wallis, Zeyuan Allen- Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. 2022. LoRA: Low-rank adaptation of large language models. InInternational Conference on Learning Representations. Yining Huang, Keke Tang, Meilian Chen, and Boyuan W...

work page internal anchor Pith review Pith/arXiv arXiv 2009
[3]

InProceedings of the 23rd Workshop on Biomedi- cal Natural Language Processing, pages 277–294, Bangkok, Thailand

K-QA: A real-world medical Q&A benchmark. InProceedings of the 23rd Workshop on Biomedi- cal Natural Language Processing, pages 277–294, Bangkok, Thailand. Association for Computational Linguistics. Yulong Mao, Kaiyu Huang, Changhao Guan, Ganglin Bao, Fengran Mo, and Jinan Xu. 2024. DoRA: En- hancing parameter-efficient fine-tuning with dynamic rank distr...

work page arXiv 2024
[4]

MedXpertQA: Benchmarking Expert-Level Medical Reasoning and Understanding

JudgeLM: Fine-tuned large language models are scalable judges. InThe Thirteenth International Conference on Learning Representations. Yuxin Zuo, Shang Qu, Yifei Li, Zhangren Chen, Xuekai Zhu, Ermo Hua, Kaiyan Zhang, Ning Ding, and Bowen Zhou. 2025. Medxpertqa: Benchmarking expert-level medical reasoning and understanding. arXiv preprint arXiv:2501.18362. ...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[5]

Text collection through web scraping, raw tex- tual sources, and optical character recognition (OCR)
[6]

Sentence splitting using heuristic methods
[7]

Aggressive filtering to remove short or low- quality sentences
[8]

The dataset comprises 571 436 examples spanning MCQUs, MCQs, and OEQAs

Language classification using a custom classi- fier trained on multilingual corpora B SFT Training Corpus : MedInjection-Fr Description B.1 Overview MedInjection-FR (Belmadani et al., 2026b) is a large-scale French biomedical instruction dataset composed of native, translated, and synthetic instruction–response pairs. The dataset comprises 571 436 example...

work page arXiv 2024
[9]

an instruction describing the task,
[10]

the question (and answer options when appli- cable),
[11]

an optional context section, and
[12]

suspicious

a response header indicating where the model output should begin. Task-Specific Constraints.The only variation across task types lies in the expected response for- mat, which is explicitly stated in the instruction. Table 6 summarizes the templates used for each task. Canonical Prompt Format.The following ab- stract template illustrates the prompt structu...

work page arXiv 2091

[1] [1]

Aryo Gema, Pasquale Minervini, Luke Daines, Tom Hope, and Beatrice Alex

Medical mt5: An open-source multilingual text-to-text llm for the medical domain.Preprint, arXiv:2404.07613. Aryo Gema, Pasquale Minervini, Luke Daines, Tom Hope, and Beatrice Alex. 2024. Parameter-efficient fine-tuning of LLaMA for the clinical domain. In Proceedings of the 6th Clinical Natural Language Processing Workshop, pages 91–104, Mexico City, Mex...

work page arXiv 2024

[2] [2]

Measuring Massive Multitask Language Understanding

Measuring massive multitask language under- standing.Preprint, arXiv:2009.03300. Edward J Hu, yelong shen, Phillip Wallis, Zeyuan Allen- Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. 2022. LoRA: Low-rank adaptation of large language models. InInternational Conference on Learning Representations. Yining Huang, Keke Tang, Meilian Chen, and Boyuan W...

work page internal anchor Pith review Pith/arXiv arXiv 2009

[3] [3]

InProceedings of the 23rd Workshop on Biomedi- cal Natural Language Processing, pages 277–294, Bangkok, Thailand

K-QA: A real-world medical Q&A benchmark. InProceedings of the 23rd Workshop on Biomedi- cal Natural Language Processing, pages 277–294, Bangkok, Thailand. Association for Computational Linguistics. Yulong Mao, Kaiyu Huang, Changhao Guan, Ganglin Bao, Fengran Mo, and Jinan Xu. 2024. DoRA: En- hancing parameter-efficient fine-tuning with dynamic rank distr...

work page arXiv 2024

[4] [4]

MedXpertQA: Benchmarking Expert-Level Medical Reasoning and Understanding

JudgeLM: Fine-tuned large language models are scalable judges. InThe Thirteenth International Conference on Learning Representations. Yuxin Zuo, Shang Qu, Yifei Li, Zhangren Chen, Xuekai Zhu, Ermo Hua, Kaiyan Zhang, Ning Ding, and Bowen Zhou. 2025. Medxpertqa: Benchmarking expert-level medical reasoning and understanding. arXiv preprint arXiv:2501.18362. ...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[5] [5]

Text collection through web scraping, raw tex- tual sources, and optical character recognition (OCR)

[6] [6]

Sentence splitting using heuristic methods

[7] [7]

Aggressive filtering to remove short or low- quality sentences

[8] [8]

The dataset comprises 571 436 examples spanning MCQUs, MCQs, and OEQAs

Language classification using a custom classi- fier trained on multilingual corpora B SFT Training Corpus : MedInjection-Fr Description B.1 Overview MedInjection-FR (Belmadani et al., 2026b) is a large-scale French biomedical instruction dataset composed of native, translated, and synthetic instruction–response pairs. The dataset comprises 571 436 example...

work page arXiv 2024

[9] [9]

an instruction describing the task,

[10] [10]

the question (and answer options when appli- cable),

[11] [11]

an optional context section, and

[12] [12]

suspicious

a response header indicating where the model output should begin. Task-Specific Constraints.The only variation across task types lies in the expected response for- mat, which is explicitly stated in the instruction. Table 6 summarizes the templates used for each task. Canonical Prompt Format.The following ab- stract template illustrates the prompt structu...

work page arXiv 2091