arxiv: 2604.19394 · v1 · submitted 2026-04-21 · 💻 cs.CL

Recognition: unknown

Can Continual Pre-training Bridge the Performance Gap between General-purpose and Specialized Language Models in the Medical Domain?

Niclas Doll , Jasper Schulze Buschhoff , Shalaka Satheesh , Hammam Abdelwahab , H\'ector Allende-Cid , Katrin Klug

Authors on Pith no claims yet

Pith reviewed 2026-05-10 02:25 UTC · model grok-4.3

classification 💻 cs.CL

keywords continual pre-trainingmedical domain adaptationGerman medical LLMsmodel mergingspecialized language modelsnon-English medical dataFineMed-de corpusinstruction-following

0 comments

The pith

Continual pre-training on a German medical corpus lets 7B models achieve a 3.5-fold win-rate gain over much larger general-purpose models on medical tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that domain adaptation through continual pre-training and model merging can close much of the performance gap between small specialized language models and far larger general ones in the medical domain. It constructs a German medical corpus called FineMed-de to overcome the lack of high-quality non-English medical data, then applies it to adapt models ranging from 7B to 24B parameters. Evaluations on German medical benchmarks show that the resulting specialized 7B models dramatically outperform their base versions and even compete favorably against larger general models in instruction-following. The work also notes that while merging helps recover instruction abilities, it creates trade-offs such as language mixing and verbosity. This points to a practical path for building efficient, domain-specific models for non-English medical applications without requiring massive scale.

Core claim

By continually pre-training and merging three LLMs on the FineMed-de corpus, the resulting DeFineMed family shows that specialization dramatically improves 7B model performance on German medical benchmarks, including an approximately 3.5-fold increase in pairwise win-rate against the 24B-parameter Mistral-Small-Instruct model, positioning specialized small models as competitive and resource-efficient for complex medical tasks.

What carries the argument

Continual pre-training on the FineMed-de corpus followed by model merging to adapt general LLMs to the German medical domain.

If this is right

Specialized 7B models become viable alternatives to much larger general models for German medical instruction-following.
Model merging can restore instruction-following capabilities lost during continual pre-training.
Domain adaptation introduces measurable trade-offs including language mixing and increased verbosity in outputs.
The approach offers a compliant methodology for creating specialized medical LLMs in non-English settings.
Further targeted fine-tuning is needed to mitigate the observed failure modes.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar corpus-construction and adaptation pipelines could be applied to other low-resource languages or medical sub-specialties where general models underperform.
Resource savings from using 7B models instead of 24B+ ones could enable wider deployment in privacy-sensitive healthcare environments.
The observed trade-offs suggest that merging alone may not be optimal and could be combined with other alignment techniques for better balance.

Load-bearing premise

The FineMed-de corpus is high-quality, representative of real German medical knowledge, and free of harmful biases or noise that would degrade model behavior.

What would settle it

A new evaluation on a held-out set of authentic German-language medical queries and cases where the adapted 7B models show no win-rate gain or increased errors compared to the base general models would falsify the central performance claim.

Figures

Figures reproduced from arXiv: 2604.19394 by Hammam Abdelwahab, H\'ector Allende-Cid, Jasper Schulze Buschhoff, Katrin Klug, Niclas Doll, Shalaka Satheesh.

**Figure 2.** Figure 2: Matrix of model win-rates on the German MedAlpaca dataset, where the value represents the winrate of the row model over the column model in a headto-head comparison. dicate that specialization via continual pre-training and subsequent model merging often mitigates a majority of common failure modes. Across the Mistral-7B and Qwen2.5-7B families, a clear trend of mitigation is observed in failure modes su… view at source ↗

**Figure 3.** Figure 3: Frequency count of distinct failure modes for base instruction-tuned and merged models, quantified using [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗

**Figure 4.** Figure 4: Performance metrics of the medical document [PITH_FULL_IMAGE:figures/full_fig_p013_4.png] view at source ↗

**Figure 5.** Figure 5: Weak scaling behavior on Karolina and Leonardo. The actual computation per accelerator is kept constant throughout, with a micro batch size of 1. Research by Tunstall et al. (2023) suggests that training beyond 1.5 epochs yields minimal additional benefits, which supports our training regime of 2 epochs with 0.5 epochs of warmup. Additionally, Muennighoff et al. (2023) indicate that four epochs are typic… view at source ↗

read the original abstract

This paper narrows the performance gap between small, specialized models and significantly larger general-purpose models through domain adaptation via continual pre-training and merging. We address the scarcity of specialized non-English data by constructing a high-quality German medical corpus (FineMed-de) from FineWeb2. This corpus is used to continually pre-train and merge three well-known LLMs (ranging from $7B$ to $24B$ parameters), creating the DeFineMed model family. A comprehensive evaluation confirms that specialization dramatically enhances $7B$ model performance on German medical benchmarks. Furthermore, the pairwise win-rate analysis of the Qwen2.5-based models demonstrates an approximately $3.5$-fold increase in the win-rate against the much larger Mistral-Small-24B-Instruct through domain adaptation. This evidence positions specialized $7B$ models as a competitive, resource-efficient solution for complex medical instruction-following tasks. While model merging successfully restores instruction-following abilities, a subsequent failure mode analysis reveals inherent trade-offs, including the introduction of language mixing and increased verbosity, highlighting the need for more targeted fine-tuning in future work. This research provides a robust, compliant methodology for developing specialized LLMs, serving as the foundation for practical use in German-speaking healthcare contexts.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Continual pre-training on FineMed-de gives a 7B model a clear lift on German medical tasks and a 3.5x win-rate gain versus a 24B baseline, but the corpus lacks reported validation steps.

read the letter

The main takeaway is that continual pre-training plus merging on a new German medical corpus lets a 7B model close a noticeable gap to larger general models on medical instruction tasks. The 3.5-fold win-rate improvement against the 24B Mistral model is the headline empirical result, and it comes from the Qwen2.5-based DeFineMed variant after adaptation on FineMed-de extracted from FineWeb2. Merging is used to recover instruction following that pre-training tends to erode, and the authors flag downstream issues like language mixing and extra verbosity. This is a direct, applied demonstration rather than a new method. The value lies in the non-English medical setting, where data is scarce, and in showing that a smaller specialized model can become competitive without starting from scratch. The work supplies a concrete pipeline and numbers that others can build on for European healthcare use cases. The soft spots are around the data. The paper states that FineMed-de is high-quality, yet it gives no details on medical-expert review, relevance filtering metrics, or explicit checks against the evaluation benchmarks. If overlap or noise is present, the gains could be partly artifactual. The abstract also leaves the exact benchmarks, statistical tests, and error bars unspecified, which makes it harder to judge how robust the lifts really are. The failure-mode section is honest but brief. This paper is for groups working on domain adaptation, medical LLMs, or non-English specialization. Readers who need practical examples of efficient fine-tuning paths will get usable information from the results and the merging step. It is grounded enough in experiments to merit peer review, though the data-construction and evaluation sections would need tightening before acceptance.

Referee Report

3 major / 2 minor

Summary. The manuscript claims that continual pre-training on a newly constructed German medical corpus (FineMed-de) extracted from FineWeb2, combined with model merging, can bridge the performance gap between small specialized LLMs and larger general-purpose models. Specifically, it reports that this approach dramatically improves 7B model performance on German medical benchmarks and yields an approximately 3.5-fold increase in win-rate for Qwen2.5-based models against the 24B Mistral-Small-24B-Instruct, while model merging helps restore instruction-following abilities despite some trade-offs like language mixing and verbosity.

Significance. If substantiated with rigorous validation, the results would be significant for domain-adapted language models, especially in non-English medical applications. They indicate that targeted continual pre-training on domain data can enable resource-efficient 7B models to compete with much larger general models, with practical implications for German-speaking healthcare. The work also contributes a methodology for addressing data scarcity in specialized domains and examines the benefits and limitations of post-adaptation model merging.

major comments (3)

[Abstract] The abstract states clear performance lifts and a 3.5x win-rate improvement without providing details on the exact benchmarks used, statistical tests performed, baseline comparisons, or error bars. This makes it difficult to evaluate the strength of the evidence supporting the central claim of bridging the performance gap through specialization.
[Corpus Construction] The FineMed-de corpus is presented as high-quality and representative, yet the manuscript provides no information on medical expert validation, quantitative filtering metrics for relevance, or decontamination procedures against the evaluation benchmarks. Since the gains are attributed to domain adaptation on this corpus, the absence of these checks is a load-bearing concern that could indicate the improvements stem from data artifacts rather than genuine medical knowledge acquisition.
[Evaluation and Failure Mode Analysis] The pairwise win-rate analysis and failure mode discussion highlight trade-offs such as language mixing and increased verbosity after merging. However, it is not clear from the reported results how these issues quantitatively affect performance on medical instruction-following tasks or whether they undermine the claimed competitiveness of the specialized 7B models.

minor comments (2)

[Introduction] Ensure consistent use of model names (e.g., Qwen2.5-based models) and provide references to the original papers for the base LLMs used.
[Evaluation] The manuscript would benefit from including error bars or confidence intervals in any reported performance metrics to allow assessment of result stability.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the detailed and constructive feedback on our manuscript. We have carefully considered each major comment and provide point-by-point responses below. Where appropriate, we have made revisions to improve the clarity and rigor of the paper.

read point-by-point responses

Referee: [Abstract] The abstract states clear performance lifts and a 3.5x win-rate improvement without providing details on the exact benchmarks used, statistical tests performed, baseline comparisons, or error bars. This makes it difficult to evaluate the strength of the evidence supporting the central claim of bridging the performance gap through specialization.

Authors: We agree that the abstract could be strengthened by including more specific details. In the revised manuscript, we will update the abstract to explicitly name the German medical benchmarks used, reference the statistical methods for win-rate calculations (including any significance testing), clarify the baseline models (such as the 24B Mistral-Small-24B-Instruct), and point to the error bars or variance reported in the main results sections. This will provide readers with a clearer view of the evidence without exceeding abstract length constraints. revision: yes
Referee: [Corpus Construction] The FineMed-de corpus is presented as high-quality and representative, yet the manuscript provides no information on medical expert validation, quantitative filtering metrics for relevance, or decontamination procedures against the evaluation benchmarks. Since the gains are attributed to domain adaptation on this corpus, the absence of these checks is a load-bearing concern that could indicate the improvements stem from data artifacts rather than genuine medical knowledge acquisition.

Authors: We acknowledge this valid concern regarding the corpus construction details. The original manuscript describes the extraction from FineWeb2 but lacks sufficient specifics on validation. In the revision, we will add quantitative details on the filtering metrics employed (e.g., perplexity thresholds, domain relevance scores via keyword matching or embeddings), and explicit decontamination steps to remove any overlap with evaluation benchmarks. Regarding medical expert validation, we did not conduct a formal review by domain experts due to practical limitations; we will explicitly state this as a limitation and describe the automated and heuristic-based quality assurance methods used instead. This addresses the potential for data artifacts. revision: partial
Referee: [Evaluation and Failure Mode Analysis] The pairwise win-rate analysis and failure mode discussion highlight trade-offs such as language mixing and increased verbosity after merging. However, it is not clear from the reported results how these issues quantitatively affect performance on medical instruction-following tasks or whether they undermine the claimed competitiveness of the specialized 7B models.

Authors: We thank the referee for pointing out the need for more quantitative analysis of the failure modes. The manuscript includes a qualitative discussion of language mixing and verbosity. For the revised version, we will incorporate quantitative assessments, such as the proportion of outputs flagged for language mixing using automated language identification tools, comparisons of response lengths (verbosity metrics), and their correlation with task performance scores on the medical benchmarks. This will help evaluate the impact on the overall competitiveness of the 7B models and inform the discussion on trade-offs. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical pipeline with no derivations or self-referential reductions

full rationale

The paper constructs FineMed-de from FineWeb2, performs continual pre-training and merging on existing LLMs, then reports benchmark scores and win-rates. No equations, fitted parameters renamed as predictions, uniqueness theorems, or ansatzes are invoked. All headline results (7B specialization gains, 3.5x win-rate) are direct empirical measurements, not reductions to inputs by construction. Self-citations, if present, are not load-bearing for any claimed derivation.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No explicit free parameters, axioms, or invented entities are stated in the abstract; the work relies on standard assumptions of LLM continual pre-training and merging.

pith-pipeline@v0.9.0 · 5553 in / 1171 out tokens · 31079 ms · 2026-05-10T02:25:23.373876+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

85 extracted references · 25 canonical work pages · 1 internal anchor

[1]

Aho and Jeffrey D

Alfred V. Aho and Jeffrey D. Ullman , title =. 1972

1972
[2]

Publications Manual , year = "1983", publisher =

1983
[3]

Chandra and Dexter C

Ashok K. Chandra and Dexter C. Kozen and Larry J. Stockmeyer , year = "1981", title =. doi:10.1145/322234.322243

work page doi:10.1145/322234.322243 1981
[4]

Scalable training of

Andrew, Galen and Gao, Jianfeng , booktitle=. Scalable training of
[5]

Dan Gusfield , title =. 1997

1997
[6]

Tetreault , title =

Mohammad Sadegh Rasooli and Joel R. Tetreault , title =. Computing Research Repository , volume =. 2015 , url =

2015
[7]

A Framework for Learning Predictive Structures from Multiple Tasks and Unlabeled Data , Volume =

Ando, Rie Kubota and Zhang, Tong , Issn =. A Framework for Learning Predictive Structures from Multiple Tasks and Unlabeled Data , Volume =. Journal of Machine Learning Research , Month = dec, Numpages =
[8]

2024 , eprint=

Mixtral of Experts , author=. 2024 , eprint=

2024
[9]

2020 , eprint=

Unsupervised Cross-lingual Representation Learning at Scale , author=. 2020 , eprint=

2020
[10]

2019 , eprint=

Decoupled Weight Decay Regularization , author=. 2019 , eprint=

2019
[11]

doi:10.57967/hf/3744 , url =

Penedo, Guilherme and Kydlíček, Hynek and Sabolčec, Vinko and Messmer, Bettina and Foroutan, Negar and Jaggi, Martin and von Werra, Leandro and Wolf, Thomas , title =. doi:10.57967/hf/3744 , url =

work page doi:10.57967/hf/3744
[12]

The 2023 Conference on Empirical Methods in Natural Language Processing , year=

Kargaran, Amir Hossein and Imani, Ayyoob and Yvon, Fran. The 2023 Conference on Empirical Methods in Natural Language Processing , year=

2023
[13]

2023 , eprint=

Mistral 7B , author=. 2023 , eprint=

2023
[14]

2025 , eprint=

Qwen2.5 Technical Report , author=. 2025 , eprint=

2025
[15]

2024 , eprint=

Gemma 2: Improving Open Language Models at a Practical Size , author=. 2024 , eprint=

2024
[16]

Thomas Wolf and Lysandre Debut and Victor Sanh and Julien Chaumond and Clement Delangue and Anthony Moi and Pierric Cistac and Tim Rault and Rémi Louf and Morgan Funtowicz and Joe Davison and Sam Shleifer and Patrick von Platen and Clara Ma and Yacine Jernite and Julien Plu and Canwen Xu and Teven Le Scao and Sylvain Gugger and Mariama Drame and Quentin L...

2020
[17]

, author =

Accelerate: Training and inference at scale made simple, efficient and adaptable. , author =
[18]

2023 , eprint=

PyTorch FSDP: Experiences on Scaling Fully Sharded Data Parallel , author=. 2023 , eprint=

2023
[19]

2023 , eprint=

FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning , author=. 2023 , eprint=

2023
[20]

2024 , eprint=

BioMistral: A Collection of Open-Source Pretrained Large Language Models for Medical Domains , author=. 2024 , eprint=

2024
[21]

arXiv preprint arXiv:2408.07666 , year=

Enneng Yang and Li Shen and Guibing Guo and Xingwei Wang and Xiaochun Cao and Jie Zhang and Dacheng Tao , year=. Model Merging in. 2408.07666 , archivePrefix=

work page arXiv
[22]

doi: 10.18653/v1/2024.emnlp-industry.36

Goddard, Charles and Siriwardhana, Shamane and Ehghaghi, Malikeh and Meyers, Luke and Karpukhin, Vladimir and Benedict, Brian and McQuade, Mark and Solawetz, Jacob. Arcee ' s M erge K it: A Toolkit for Merging Large Language Models. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing: Industry Track. 2024. doi:10.18653/v...

work page doi:10.18653/v1/2024.emnlp-industry.36 2024
[23]

Proceedings of the 12th Annual Conference on Computer Graphics and Interactive Techniques , pages =

Shoemake, Ken , title =. Proceedings of the 12th Annual Conference on Computer Graphics and Interactive Techniques , pages =. 1985 , isbn =. doi:10.1145/325334.325242 , abstract =

work page doi:10.1145/325334.325242 1985
[24]

2025 , eprint=

Activation-Informed Merging of Large Language Models , author=. 2025 , eprint=

2025
[25]

LLM Merging Competition at NeurIPS 2024 , year=

Model Merging using Geometric Median of Task Vectors , author=. LLM Merging Competition at NeurIPS 2024 , year=

2024
[26]

2024 , eprint=

What Matters for Model Merging at Scale? , author=. 2024 , eprint=

2024
[27]

Parameter-Efficient Approaches , author=

Med42 -- Evaluating Fine-Tuning Strategies for Medical LLMs: Full-Parameter vs. Parameter-Efficient Approaches , author=. 2024 , eprint=

2024
[28]

2025 , eprint=

Are We Done with MMLU? , author=. 2025 , eprint=

2025
[29]

A comprehensive survey of large language models and multimodal large language models in medicine , journal =

Hanguang Xiao and Feizhong Zhou and Xingyue Liu and Tianqi Liu and Zhipeng Li and Xin Liu and Xiaoxuan Huang , keywords =. A comprehensive survey of large language models and multimodal large language models in medicine , journal =. 2025 , issn =. doi:https://doi.org/10.1016/j.inffus.2024.102888 , url =

work page doi:10.1016/j.inffus.2024.102888 2025
[30]

BioGPT: generative pre-trained transformer for biomedical text generation and mining.Brief Bioinform.2022;23(6):bbac409

Luo, Renqian and Sun, Liai and Xia, Yingce and Qin, Tao and Zhang, Sheng and Poon, Hoifung and Liu, Tie-Yan , year=. BioGPT: generative pre-trained transformer for biomedical text generation and mining , volume=. Briefings in Bioinformatics , publisher=. doi:10.1093/bib/bbac409 , number=

work page doi:10.1093/bib/bbac409
[31]

npj Digital Medicine6, 210 (11 2023)

Peng, Cheng and Yang, Xi and Chen, Aokun and Smith, Kaleb E. and PourNejatian, Nima and Costa, Anthony B. and Martin, Cheryl and Flores, Mona G. and Zhang, Ying and Magoc, Tanja and Lipori, Gloria and Mitchell, Duane A. and Ospina, Naykky S. and Ahmed, Mustafa M. and Hogan, William R. and Shenkman, Elizabeth A. and Guo, Yi and Bian, Jiang and Wu, Yonghui ...

work page doi:10.1038/s41746-023-00958-w
[32]

Pfohl, Heather Cole-Lewis, Darlene Neal, Qazi Mamunur Rashid, Mike Schaekermann, Amy Wang, Dev Dash, Jonathan H

Karan Singhal and Tien Tu and Johannes Gottweis and others , title =. Nature Medicine , year =. doi:10.1038/s41591-024-03423-7 , url =

work page doi:10.1038/s41591-024-03423-7
[33]

2023 , eprint=

MedAlpaca -- An Open-Source Collection of Medical Conversational AI Models and Training Data , author=. 2023 , eprint=

2023
[34]

2023 , eprint=

Clinical Camel: An Open Expert-Level Medical Language Model with Dialogue-Based Knowledge Encoding , author=. 2023 , eprint=

2023
[35]

Adapted large language models can outperform medical experts in clinical text summarization.Nat Med.2024;30(4):1134–1142

Van Veen, Dave and Van Uden, Cara and Blankemeier, Louis and Delbrouck, Jean-Benoit and Aali, Asad and Bluethgen, Christian and Pareek, Anuj and Polacin, Malgorzata and Reis, Eduardo Pontes and Seehofnerová, Anna and Rohatgi, Nidhi and Hosamani, Poonam and Collins, William and Ahuja, Neera and Langlotz, Curtis P. and Hom, Jason and Gatidis, Sergios and Pa...

work page doi:10.1038/s41591-024-02855-5
[36]

2023 , eprint=

Can Generalist Foundation Models Outcompete Special-Purpose Tuning? Case Study in Medicine , author=. 2023 , eprint=

2023
[37]

2023 , eprint=

DeID-GPT: Zero-shot Medical Text De-Identification by GPT-4 , author=. 2023 , eprint=

2023
[38]

2024 , eprint=

It's Morphing Time: Unleashing the Potential of Multiple LLMs via Multi-objective Optimization , author=. 2024 , eprint=

2024
[39]

BMC Medical Informatics and Decision Making , author =

From admission to discharge: a systematic review of clinical natural language processing along the patient journey , volume =. BMC Medical Informatics and Decision Making , author =. 2024 , pages =. doi:10.1186/s12911-024-02641-w , abstract =

work page doi:10.1186/s12911-024-02641-w 2024
[40]

Comparison of the Performance of GPT-3.5 and GPT-4 With That of Medical Students on the Written German Medical Licensing Examination: Observational Study

Meyer, Annika and Riese, Janik and Streichert, Thomas. Comparison of the Performance of GPT-3.5 and GPT-4 With That of Medical Students on the Written German Medical Licensing Examination: Observational Study. JMIR Med Educ. 2024. doi:10.2196/50965

work page doi:10.2196/50965 2024
[41]

MedExpQA: Multilingual Benchmarking of Large Language Models for Medical Question Answering

Iñigo Alonso and Maite Oronoz and Rodrigo Agerri , keywords =. MedExpQA: Multilingual benchmarking of Large Language Models for Medical Question Answering , journal =. 2024 , issn =. doi:https://doi.org/10.1016/j.artmed.2024.102938 , url =

work page doi:10.1016/j.artmed.2024.102938 2024
[42]

Julian Lechuga and Maniatakos, Michail and Shamout, Farah E

Guerra-Manzanares, Alejandro and Lopez, L. Julian Lechuga and Maniatakos, Michail and Shamout, Farah E. Privacy-Preserving Machine Learning for Healthcare: Open Challenges and Future Perspectives. Trustworthy Machine Learning for Healthcare. 2023

2023
[43]

doi:10.5281/zenodo.12608602 , url =

Gao, Leo and Tow, Jonathan and Abbasi, Baber and Biderman, Stella and Black, Sid and DiPofi, Anthony and Foster, Charles and Golding, Laurence and Hsu, Jeffrey and Le Noac'h, Alain and Li, Haonan and McDonell, Kyle and Muennighoff, Niklas and Ociepa, Chris and Phang, Jason and Reynolds, Laria and Schoelkopf, Hailey and Skowron, Aviya and Sutawika, Lintang...

work page doi:10.5281/zenodo.12608602
[44]

Proceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles , year=

Efficient Memory Management for Large Language Model Serving with PagedAttention , author=. Proceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles , year=
[45]

Applied Sciences , VOLUME =

Jin, Di and Pan, Eileen and Oufattole, Nassim and Weng, Wei-Hung and Fang, Hanyi and Szolovits, Peter , TITLE =. Applied Sciences , VOLUME =. 2021 , NUMBER =

2021
[46]

Combatting Dimensional Collapse in

Ziqing Fan and Siyuan Du and Shengchao Hu and Pingjie Wang and Li Shen and Ya Zhang and Dacheng Tao and Yanfeng Wang , booktitle=. Combatting Dimensional Collapse in. 2025 , url=

2025
[47]

Development of a framework for pre-processing domain-specific data using a technical language processing approach , author=
[48]

Large language model-guided document selection

Large language model-guided document selection , author=. arXiv preprint arXiv:2406.04638 , year=

work page arXiv
[49]

arXiv preprint arXiv:2501.07314 , year=

FinerWeb-10BT: Refining Web Data with LLM-Based Line-Level Filtering , author=. arXiv preprint arXiv:2501.07314 , year=

work page arXiv
[50]

Proceedings of the International Conference on Learning Representations (ICLR) , year=

Measuring Massive Multitask Language Understanding , author=. Proceedings of the International Conference on Learning Representations (ICLR) , year=
[51]

Proceedings of the International Conference on Learning Representations (ICLR) , year=

Aligning AI With Shared Human Values , author=. Proceedings of the International Conference on Learning Representations (ICLR) , year=
[52]

Biderman, H

Lessons from the Trenches on Reproducible Evaluation of Language Models , url =. arXiv , author =:2405.14782 , keywords =

work page arXiv
[53]

The Twelfth International Conference on Learning Representations , year=

Large Language Models Are Not Robust Multiple Choice Selectors , author=. The Twelfth International Conference on Learning Representations , year=
[54]

2020 , eprint=

What Disease does this Patient Have? A Large-scale Open Domain Question Answering Dataset from Medical Exams , author=. 2020 , eprint=

2020
[55]

arXiv preprint arXiv:2404.10830 , year=

Fewer truncations improve language modeling , author=. arXiv preprint arXiv:2404.10830 , year=

work page arXiv
[56]

Scaling Data-Constrained Language Models , url =

Muennighoff, Niklas and Rush, Alexander and Barak, Boaz and Le Scao, Teven and Tazi, Nouamane and Piktus, Aleksandra and Pyysalo, Sampo and Wolf, Thomas and Raffel, Colin A , booktitle =. Scaling Data-Constrained Language Models , url =
[57]

2025 , month =

Mistral AI, Team , title =. 2025 , month =

2025
[58]

2025 , month =

OpenAI , title =. 2025 , month =

2025
[59]

2024 , month =

Machbarkeit einer deutschen MIMIC , howpublished =. 2024 , month =

2024
[60]

2024 , month =

DiscoResearch and Occiglot and DFKI and hessian.Ai , title =. 2024 , month =

2024
[61]

2023.Foundation Models for Natural Lan- guage Processing

Paass, Gerhard and Giesselbach, Sven , year=. Foundation Models for Natural Language Processing , ISBN=. doi:10.1007/978-3-031-23190-2 , abstractNote=

work page doi:10.1007/978-3-031-23190-2
[62]

Zonglin Yang et al

Zhang, Yu and Chen, Xiusi and Jin, Bowen and Wang, Sheng and Ji, Shuiwang and Wang, Wei and Han, Jiawei. A Comprehensive Survey of Scientific Large Language Models and Their Applications in Scientific Discovery. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing. 2024. doi:10.18653/v1/2024.emnlp-main.498

work page doi:10.18653/v1/2024.emnlp-main.498 2024
[63]

arXiv preprint arXiv:2310.16944 , year=

Lewis Tunstall and Edward Beeching and Nathan Lambert and Nazneen Rajani and Kashif Rasul and Younes Belkada and Shengyi Huang and Leandro von Werra and Clémentine Fourrier and Nathan Habib and Nathan Sarrazin and Omar Sanseviero and Alexander M. Rush and Thomas Wolf , year=. 2310.16944 , archivePrefix=

work page arXiv
[64]

, title =

Randolph, Justus J. , title =. Joensuu Learning and Instruction Symposium , year =
[65]

Chatbot Arena: An Open Platform for Evaluating LLMs by Human Preference

Wei-Lin Chiang and Lianmin Zheng and Ying Sheng and Anastasios Nikolas Angelopoulos and Tianle Li and Dacheng Li and Hao Zhang and Banghua Zhu and Michael Jordan and Joseph E. Gonzalez and Ion Stoica , year=. 2403.04132 , archivePrefix=

work page internal anchor Pith review arXiv
[66]

PMC Open Access Subset , year =
[67]

2023 , eprint=

Llama 2: Open Foundation and Fine-Tuned Chat Models , author=. 2023 , eprint=

2023
[68]

2024 , eprint=

Apollo: A Lightweight Multilingual Medical LLM towards Democratizing Medical AI to 6B People , author=. 2024 , eprint=

2024
[69]

Instruction-tuned Language Models are Better Knowledge Learners

Jiang, Zhengbao and Sun, Zhiqing and Shi, Weijia and Rodriguez, Pedro and Zhou, Chunting and Neubig, Graham and Lin, Xi and Yih, Wen-tau and Iyer, Srini. Instruction-tuned Language Models are Better Knowledge Learners. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2024. doi:10.18653/v1/202...

work page doi:10.18653/v1/2024.acl-long.296 2024
[70]

2024 , eprint=

Balancing Continuous Pre-Training and Instruction Fine-Tuning: Optimizing Instruction-Following in LLMs , author=. 2024 , eprint=

2024
[71]

2025 , eprint=

Efficiently Democratizing Medical LLMs for 50 Languages via a Mixture of Language Family Experts , author=. 2025 , eprint=

2025
[72]

2022 , eprint=

TruthfulQA: Measuring How Models Mimic Human Falsehoods , author=. 2022 , eprint=

2022
[73]

Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing (EMNLP) , pages =

Adaptation Odyssey in LLMs: Why Does Additional Pretraining Sometimes Fail to Improve? , author =. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing (EMNLP) , pages =

2024
[74]

2024 , eprint=

PARAMANU-AYN: Pretrain from scratch or Continual Pretraining of LLMs for Legal Domain Adaptation? , author=. 2024 , eprint=

2024
[75]

2024 , eprint=

Domain Adaptation of Llama3-70B-Instruct through Continual Pre-Training and Model Merging: A Comprehensive Evaluation , author=. 2024 , eprint=

2024
[76]

2024 , issn =

TCM-GPT: Efficient pre-training of large language models for domain adaptation in Traditional Chinese Medicine , journal =. 2024 , issn =. doi:https://doi.org/10.1016/j.cmpbup.2024.100158 , url =

work page doi:10.1016/j.cmpbup.2024.100158 2024
[77]

2025 , eprint=

Small Language Models are the Future of Agentic AI , author=. 2025 , eprint=

2025
[78]

2025 , howpublished =

Efficiently Democratizing Medical LLMs for 50 Languages via a Mixture of Language Family Experts , author=. 2025 , howpublished =

2025
[79]

2019 , eprint=

HellaSwag: Can a Machine Really Finish Your Sentence? , author=. 2019 , eprint=

2019
[80]

2021 , eprint=

Training Verifiers to Solve Math Word Problems , author=. 2021 , eprint=

2021

Showing first 80 references.