Recognition: unknown
Can Continual Pre-training Bridge the Performance Gap between General-purpose and Specialized Language Models in the Medical Domain?
Pith reviewed 2026-05-10 02:25 UTC · model grok-4.3
The pith
Continual pre-training on a German medical corpus lets 7B models achieve a 3.5-fold win-rate gain over much larger general-purpose models on medical tasks.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
By continually pre-training and merging three LLMs on the FineMed-de corpus, the resulting DeFineMed family shows that specialization dramatically improves 7B model performance on German medical benchmarks, including an approximately 3.5-fold increase in pairwise win-rate against the 24B-parameter Mistral-Small-Instruct model, positioning specialized small models as competitive and resource-efficient for complex medical tasks.
What carries the argument
Continual pre-training on the FineMed-de corpus followed by model merging to adapt general LLMs to the German medical domain.
If this is right
- Specialized 7B models become viable alternatives to much larger general models for German medical instruction-following.
- Model merging can restore instruction-following capabilities lost during continual pre-training.
- Domain adaptation introduces measurable trade-offs including language mixing and increased verbosity in outputs.
- The approach offers a compliant methodology for creating specialized medical LLMs in non-English settings.
- Further targeted fine-tuning is needed to mitigate the observed failure modes.
Where Pith is reading between the lines
- Similar corpus-construction and adaptation pipelines could be applied to other low-resource languages or medical sub-specialties where general models underperform.
- Resource savings from using 7B models instead of 24B+ ones could enable wider deployment in privacy-sensitive healthcare environments.
- The observed trade-offs suggest that merging alone may not be optimal and could be combined with other alignment techniques for better balance.
Load-bearing premise
The FineMed-de corpus is high-quality, representative of real German medical knowledge, and free of harmful biases or noise that would degrade model behavior.
What would settle it
A new evaluation on a held-out set of authentic German-language medical queries and cases where the adapted 7B models show no win-rate gain or increased errors compared to the base general models would falsify the central performance claim.
Figures
read the original abstract
This paper narrows the performance gap between small, specialized models and significantly larger general-purpose models through domain adaptation via continual pre-training and merging. We address the scarcity of specialized non-English data by constructing a high-quality German medical corpus (FineMed-de) from FineWeb2. This corpus is used to continually pre-train and merge three well-known LLMs (ranging from $7B$ to $24B$ parameters), creating the DeFineMed model family. A comprehensive evaluation confirms that specialization dramatically enhances $7B$ model performance on German medical benchmarks. Furthermore, the pairwise win-rate analysis of the Qwen2.5-based models demonstrates an approximately $3.5$-fold increase in the win-rate against the much larger Mistral-Small-24B-Instruct through domain adaptation. This evidence positions specialized $7B$ models as a competitive, resource-efficient solution for complex medical instruction-following tasks. While model merging successfully restores instruction-following abilities, a subsequent failure mode analysis reveals inherent trade-offs, including the introduction of language mixing and increased verbosity, highlighting the need for more targeted fine-tuning in future work. This research provides a robust, compliant methodology for developing specialized LLMs, serving as the foundation for practical use in German-speaking healthcare contexts.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript claims that continual pre-training on a newly constructed German medical corpus (FineMed-de) extracted from FineWeb2, combined with model merging, can bridge the performance gap between small specialized LLMs and larger general-purpose models. Specifically, it reports that this approach dramatically improves 7B model performance on German medical benchmarks and yields an approximately 3.5-fold increase in win-rate for Qwen2.5-based models against the 24B Mistral-Small-24B-Instruct, while model merging helps restore instruction-following abilities despite some trade-offs like language mixing and verbosity.
Significance. If substantiated with rigorous validation, the results would be significant for domain-adapted language models, especially in non-English medical applications. They indicate that targeted continual pre-training on domain data can enable resource-efficient 7B models to compete with much larger general models, with practical implications for German-speaking healthcare. The work also contributes a methodology for addressing data scarcity in specialized domains and examines the benefits and limitations of post-adaptation model merging.
major comments (3)
- [Abstract] The abstract states clear performance lifts and a 3.5x win-rate improvement without providing details on the exact benchmarks used, statistical tests performed, baseline comparisons, or error bars. This makes it difficult to evaluate the strength of the evidence supporting the central claim of bridging the performance gap through specialization.
- [Corpus Construction] The FineMed-de corpus is presented as high-quality and representative, yet the manuscript provides no information on medical expert validation, quantitative filtering metrics for relevance, or decontamination procedures against the evaluation benchmarks. Since the gains are attributed to domain adaptation on this corpus, the absence of these checks is a load-bearing concern that could indicate the improvements stem from data artifacts rather than genuine medical knowledge acquisition.
- [Evaluation and Failure Mode Analysis] The pairwise win-rate analysis and failure mode discussion highlight trade-offs such as language mixing and increased verbosity after merging. However, it is not clear from the reported results how these issues quantitatively affect performance on medical instruction-following tasks or whether they undermine the claimed competitiveness of the specialized 7B models.
minor comments (2)
- [Introduction] Ensure consistent use of model names (e.g., Qwen2.5-based models) and provide references to the original papers for the base LLMs used.
- [Evaluation] The manuscript would benefit from including error bars or confidence intervals in any reported performance metrics to allow assessment of result stability.
Simulated Author's Rebuttal
We thank the referee for the detailed and constructive feedback on our manuscript. We have carefully considered each major comment and provide point-by-point responses below. Where appropriate, we have made revisions to improve the clarity and rigor of the paper.
read point-by-point responses
-
Referee: [Abstract] The abstract states clear performance lifts and a 3.5x win-rate improvement without providing details on the exact benchmarks used, statistical tests performed, baseline comparisons, or error bars. This makes it difficult to evaluate the strength of the evidence supporting the central claim of bridging the performance gap through specialization.
Authors: We agree that the abstract could be strengthened by including more specific details. In the revised manuscript, we will update the abstract to explicitly name the German medical benchmarks used, reference the statistical methods for win-rate calculations (including any significance testing), clarify the baseline models (such as the 24B Mistral-Small-24B-Instruct), and point to the error bars or variance reported in the main results sections. This will provide readers with a clearer view of the evidence without exceeding abstract length constraints. revision: yes
-
Referee: [Corpus Construction] The FineMed-de corpus is presented as high-quality and representative, yet the manuscript provides no information on medical expert validation, quantitative filtering metrics for relevance, or decontamination procedures against the evaluation benchmarks. Since the gains are attributed to domain adaptation on this corpus, the absence of these checks is a load-bearing concern that could indicate the improvements stem from data artifacts rather than genuine medical knowledge acquisition.
Authors: We acknowledge this valid concern regarding the corpus construction details. The original manuscript describes the extraction from FineWeb2 but lacks sufficient specifics on validation. In the revision, we will add quantitative details on the filtering metrics employed (e.g., perplexity thresholds, domain relevance scores via keyword matching or embeddings), and explicit decontamination steps to remove any overlap with evaluation benchmarks. Regarding medical expert validation, we did not conduct a formal review by domain experts due to practical limitations; we will explicitly state this as a limitation and describe the automated and heuristic-based quality assurance methods used instead. This addresses the potential for data artifacts. revision: partial
-
Referee: [Evaluation and Failure Mode Analysis] The pairwise win-rate analysis and failure mode discussion highlight trade-offs such as language mixing and increased verbosity after merging. However, it is not clear from the reported results how these issues quantitatively affect performance on medical instruction-following tasks or whether they undermine the claimed competitiveness of the specialized 7B models.
Authors: We thank the referee for pointing out the need for more quantitative analysis of the failure modes. The manuscript includes a qualitative discussion of language mixing and verbosity. For the revised version, we will incorporate quantitative assessments, such as the proportion of outputs flagged for language mixing using automated language identification tools, comparisons of response lengths (verbosity metrics), and their correlation with task performance scores on the medical benchmarks. This will help evaluate the impact on the overall competitiveness of the 7B models and inform the discussion on trade-offs. revision: yes
Circularity Check
No circularity: purely empirical pipeline with no derivations or self-referential reductions
full rationale
The paper constructs FineMed-de from FineWeb2, performs continual pre-training and merging on existing LLMs, then reports benchmark scores and win-rates. No equations, fitted parameters renamed as predictions, uniqueness theorems, or ansatzes are invoked. All headline results (7B specialization gains, 3.5x win-rate) are direct empirical measurements, not reductions to inputs by construction. Self-citations, if present, are not load-bearing for any claimed derivation.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Aho and Jeffrey D
Alfred V. Aho and Jeffrey D. Ullman , title =. 1972
1972
-
[2]
Publications Manual , year = "1983", publisher =
1983
-
[3]
Ashok K. Chandra and Dexter C. Kozen and Larry J. Stockmeyer , year = "1981", title =. doi:10.1145/322234.322243
-
[4]
Scalable training of
Andrew, Galen and Gao, Jianfeng , booktitle=. Scalable training of
-
[5]
Dan Gusfield , title =. 1997
1997
-
[6]
Tetreault , title =
Mohammad Sadegh Rasooli and Joel R. Tetreault , title =. Computing Research Repository , volume =. 2015 , url =
2015
-
[7]
A Framework for Learning Predictive Structures from Multiple Tasks and Unlabeled Data , Volume =
Ando, Rie Kubota and Zhang, Tong , Issn =. A Framework for Learning Predictive Structures from Multiple Tasks and Unlabeled Data , Volume =. Journal of Machine Learning Research , Month = dec, Numpages =
-
[8]
2024 , eprint=
Mixtral of Experts , author=. 2024 , eprint=
2024
-
[9]
2020 , eprint=
Unsupervised Cross-lingual Representation Learning at Scale , author=. 2020 , eprint=
2020
-
[10]
2019 , eprint=
Decoupled Weight Decay Regularization , author=. 2019 , eprint=
2019
-
[11]
Penedo, Guilherme and Kydlíček, Hynek and Sabolčec, Vinko and Messmer, Bettina and Foroutan, Negar and Jaggi, Martin and von Werra, Leandro and Wolf, Thomas , title =. doi:10.57967/hf/3744 , url =
-
[12]
The 2023 Conference on Empirical Methods in Natural Language Processing , year=
Kargaran, Amir Hossein and Imani, Ayyoob and Yvon, Fran. The 2023 Conference on Empirical Methods in Natural Language Processing , year=
2023
-
[13]
2023 , eprint=
Mistral 7B , author=. 2023 , eprint=
2023
-
[14]
2025 , eprint=
Qwen2.5 Technical Report , author=. 2025 , eprint=
2025
-
[15]
2024 , eprint=
Gemma 2: Improving Open Language Models at a Practical Size , author=. 2024 , eprint=
2024
-
[16]
Thomas Wolf and Lysandre Debut and Victor Sanh and Julien Chaumond and Clement Delangue and Anthony Moi and Pierric Cistac and Tim Rault and Rémi Louf and Morgan Funtowicz and Joe Davison and Sam Shleifer and Patrick von Platen and Clara Ma and Yacine Jernite and Julien Plu and Canwen Xu and Teven Le Scao and Sylvain Gugger and Mariama Drame and Quentin L...
2020
-
[17]
, author =
Accelerate: Training and inference at scale made simple, efficient and adaptable. , author =
-
[18]
2023 , eprint=
PyTorch FSDP: Experiences on Scaling Fully Sharded Data Parallel , author=. 2023 , eprint=
2023
-
[19]
2023 , eprint=
FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning , author=. 2023 , eprint=
2023
-
[20]
2024 , eprint=
BioMistral: A Collection of Open-Source Pretrained Large Language Models for Medical Domains , author=. 2024 , eprint=
2024
-
[21]
arXiv preprint arXiv:2408.07666 , year=
Enneng Yang and Li Shen and Guibing Guo and Xingwei Wang and Xiaochun Cao and Jie Zhang and Dacheng Tao , year=. Model Merging in. 2408.07666 , archivePrefix=
-
[22]
doi: 10.18653/v1/2024.emnlp-industry.36
Goddard, Charles and Siriwardhana, Shamane and Ehghaghi, Malikeh and Meyers, Luke and Karpukhin, Vladimir and Benedict, Brian and McQuade, Mark and Solawetz, Jacob. Arcee ' s M erge K it: A Toolkit for Merging Large Language Models. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing: Industry Track. 2024. doi:10.18653/v...
-
[23]
Proceedings of the 12th Annual Conference on Computer Graphics and Interactive Techniques , pages =
Shoemake, Ken , title =. Proceedings of the 12th Annual Conference on Computer Graphics and Interactive Techniques , pages =. 1985 , isbn =. doi:10.1145/325334.325242 , abstract =
-
[24]
2025 , eprint=
Activation-Informed Merging of Large Language Models , author=. 2025 , eprint=
2025
-
[25]
LLM Merging Competition at NeurIPS 2024 , year=
Model Merging using Geometric Median of Task Vectors , author=. LLM Merging Competition at NeurIPS 2024 , year=
2024
-
[26]
2024 , eprint=
What Matters for Model Merging at Scale? , author=. 2024 , eprint=
2024
-
[27]
Parameter-Efficient Approaches , author=
Med42 -- Evaluating Fine-Tuning Strategies for Medical LLMs: Full-Parameter vs. Parameter-Efficient Approaches , author=. 2024 , eprint=
2024
-
[28]
2025 , eprint=
Are We Done with MMLU? , author=. 2025 , eprint=
2025
-
[29]
Hanguang Xiao and Feizhong Zhou and Xingyue Liu and Tianqi Liu and Zhipeng Li and Xin Liu and Xiaoxuan Huang , keywords =. A comprehensive survey of large language models and multimodal large language models in medicine , journal =. 2025 , issn =. doi:https://doi.org/10.1016/j.inffus.2024.102888 , url =
-
[30]
Luo, Renqian and Sun, Liai and Xia, Yingce and Qin, Tao and Zhang, Sheng and Poon, Hoifung and Liu, Tie-Yan , year=. BioGPT: generative pre-trained transformer for biomedical text generation and mining , volume=. Briefings in Bioinformatics , publisher=. doi:10.1093/bib/bbac409 , number=
-
[31]
npj Digital Medicine6, 210 (11 2023)
Peng, Cheng and Yang, Xi and Chen, Aokun and Smith, Kaleb E. and PourNejatian, Nima and Costa, Anthony B. and Martin, Cheryl and Flores, Mona G. and Zhang, Ying and Magoc, Tanja and Lipori, Gloria and Mitchell, Duane A. and Ospina, Naykky S. and Ahmed, Mustafa M. and Hogan, William R. and Shenkman, Elizabeth A. and Guo, Yi and Bian, Jiang and Wu, Yonghui ...
-
[32]
Karan Singhal and Tien Tu and Johannes Gottweis and others , title =. Nature Medicine , year =. doi:10.1038/s41591-024-03423-7 , url =
-
[33]
2023 , eprint=
MedAlpaca -- An Open-Source Collection of Medical Conversational AI Models and Training Data , author=. 2023 , eprint=
2023
-
[34]
2023 , eprint=
Clinical Camel: An Open Expert-Level Medical Language Model with Dialogue-Based Knowledge Encoding , author=. 2023 , eprint=
2023
-
[35]
Van Veen, Dave and Van Uden, Cara and Blankemeier, Louis and Delbrouck, Jean-Benoit and Aali, Asad and Bluethgen, Christian and Pareek, Anuj and Polacin, Malgorzata and Reis, Eduardo Pontes and Seehofnerová, Anna and Rohatgi, Nidhi and Hosamani, Poonam and Collins, William and Ahuja, Neera and Langlotz, Curtis P. and Hom, Jason and Gatidis, Sergios and Pa...
-
[36]
2023 , eprint=
Can Generalist Foundation Models Outcompete Special-Purpose Tuning? Case Study in Medicine , author=. 2023 , eprint=
2023
-
[37]
2023 , eprint=
DeID-GPT: Zero-shot Medical Text De-Identification by GPT-4 , author=. 2023 , eprint=
2023
-
[38]
2024 , eprint=
It's Morphing Time: Unleashing the Potential of Multiple LLMs via Multi-objective Optimization , author=. 2024 , eprint=
2024
-
[39]
BMC Medical Informatics and Decision Making , author =
From admission to discharge: a systematic review of clinical natural language processing along the patient journey , volume =. BMC Medical Informatics and Decision Making , author =. 2024 , pages =. doi:10.1186/s12911-024-02641-w , abstract =
-
[40]
Meyer, Annika and Riese, Janik and Streichert, Thomas. Comparison of the Performance of GPT-3.5 and GPT-4 With That of Medical Students on the Written German Medical Licensing Examination: Observational Study. JMIR Med Educ. 2024. doi:10.2196/50965
-
[41]
MedExpQA: Multilingual Benchmarking of Large Language Models for Medical Question Answering
Iñigo Alonso and Maite Oronoz and Rodrigo Agerri , keywords =. MedExpQA: Multilingual benchmarking of Large Language Models for Medical Question Answering , journal =. 2024 , issn =. doi:https://doi.org/10.1016/j.artmed.2024.102938 , url =
-
[42]
Julian Lechuga and Maniatakos, Michail and Shamout, Farah E
Guerra-Manzanares, Alejandro and Lopez, L. Julian Lechuga and Maniatakos, Michail and Shamout, Farah E. Privacy-Preserving Machine Learning for Healthcare: Open Challenges and Future Perspectives. Trustworthy Machine Learning for Healthcare. 2023
2023
-
[43]
doi:10.5281/zenodo.12608602 , url =
Gao, Leo and Tow, Jonathan and Abbasi, Baber and Biderman, Stella and Black, Sid and DiPofi, Anthony and Foster, Charles and Golding, Laurence and Hsu, Jeffrey and Le Noac'h, Alain and Li, Haonan and McDonell, Kyle and Muennighoff, Niklas and Ociepa, Chris and Phang, Jason and Reynolds, Laria and Schoelkopf, Hailey and Skowron, Aviya and Sutawika, Lintang...
-
[44]
Proceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles , year=
Efficient Memory Management for Large Language Model Serving with PagedAttention , author=. Proceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles , year=
-
[45]
Applied Sciences , VOLUME =
Jin, Di and Pan, Eileen and Oufattole, Nassim and Weng, Wei-Hung and Fang, Hanyi and Szolovits, Peter , TITLE =. Applied Sciences , VOLUME =. 2021 , NUMBER =
2021
-
[46]
Combatting Dimensional Collapse in
Ziqing Fan and Siyuan Du and Shengchao Hu and Pingjie Wang and Li Shen and Ya Zhang and Dacheng Tao and Yanfeng Wang , booktitle=. Combatting Dimensional Collapse in. 2025 , url=
2025
-
[47]
Development of a framework for pre-processing domain-specific data using a technical language processing approach , author=
-
[48]
Large language model-guided document selection
Large language model-guided document selection , author=. arXiv preprint arXiv:2406.04638 , year=
-
[49]
arXiv preprint arXiv:2501.07314 , year=
FinerWeb-10BT: Refining Web Data with LLM-Based Line-Level Filtering , author=. arXiv preprint arXiv:2501.07314 , year=
-
[50]
Proceedings of the International Conference on Learning Representations (ICLR) , year=
Measuring Massive Multitask Language Understanding , author=. Proceedings of the International Conference on Learning Representations (ICLR) , year=
-
[51]
Proceedings of the International Conference on Learning Representations (ICLR) , year=
Aligning AI With Shared Human Values , author=. Proceedings of the International Conference on Learning Representations (ICLR) , year=
-
[52]
Lessons from the Trenches on Reproducible Evaluation of Language Models , url =. arXiv , author =:2405.14782 , keywords =
-
[53]
The Twelfth International Conference on Learning Representations , year=
Large Language Models Are Not Robust Multiple Choice Selectors , author=. The Twelfth International Conference on Learning Representations , year=
-
[54]
2020 , eprint=
What Disease does this Patient Have? A Large-scale Open Domain Question Answering Dataset from Medical Exams , author=. 2020 , eprint=
2020
-
[55]
arXiv preprint arXiv:2404.10830 , year=
Fewer truncations improve language modeling , author=. arXiv preprint arXiv:2404.10830 , year=
-
[56]
Scaling Data-Constrained Language Models , url =
Muennighoff, Niklas and Rush, Alexander and Barak, Boaz and Le Scao, Teven and Tazi, Nouamane and Piktus, Aleksandra and Pyysalo, Sampo and Wolf, Thomas and Raffel, Colin A , booktitle =. Scaling Data-Constrained Language Models , url =
-
[57]
2025 , month =
Mistral AI, Team , title =. 2025 , month =
2025
-
[58]
2025 , month =
OpenAI , title =. 2025 , month =
2025
-
[59]
2024 , month =
Machbarkeit einer deutschen MIMIC , howpublished =. 2024 , month =
2024
-
[60]
2024 , month =
DiscoResearch and Occiglot and DFKI and hessian.Ai , title =. 2024 , month =
2024
-
[61]
2023.Foundation Models for Natural Lan- guage Processing
Paass, Gerhard and Giesselbach, Sven , year=. Foundation Models for Natural Language Processing , ISBN=. doi:10.1007/978-3-031-23190-2 , abstractNote=
-
[62]
Zhang, Yu and Chen, Xiusi and Jin, Bowen and Wang, Sheng and Ji, Shuiwang and Wang, Wei and Han, Jiawei. A Comprehensive Survey of Scientific Large Language Models and Their Applications in Scientific Discovery. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing. 2024. doi:10.18653/v1/2024.emnlp-main.498
-
[63]
arXiv preprint arXiv:2310.16944 , year=
Lewis Tunstall and Edward Beeching and Nathan Lambert and Nazneen Rajani and Kashif Rasul and Younes Belkada and Shengyi Huang and Leandro von Werra and Clémentine Fourrier and Nathan Habib and Nathan Sarrazin and Omar Sanseviero and Alexander M. Rush and Thomas Wolf , year=. 2310.16944 , archivePrefix=
-
[64]
, title =
Randolph, Justus J. , title =. Joensuu Learning and Instruction Symposium , year =
-
[65]
Chatbot Arena: An Open Platform for Evaluating LLMs by Human Preference
Wei-Lin Chiang and Lianmin Zheng and Ying Sheng and Anastasios Nikolas Angelopoulos and Tianle Li and Dacheng Li and Hao Zhang and Banghua Zhu and Michael Jordan and Joseph E. Gonzalez and Ion Stoica , year=. 2403.04132 , archivePrefix=
work page internal anchor Pith review arXiv
-
[66]
PMC Open Access Subset , year =
-
[67]
2023 , eprint=
Llama 2: Open Foundation and Fine-Tuned Chat Models , author=. 2023 , eprint=
2023
-
[68]
2024 , eprint=
Apollo: A Lightweight Multilingual Medical LLM towards Democratizing Medical AI to 6B People , author=. 2024 , eprint=
2024
-
[69]
Instruction-tuned Language Models are Better Knowledge Learners
Jiang, Zhengbao and Sun, Zhiqing and Shi, Weijia and Rodriguez, Pedro and Zhou, Chunting and Neubig, Graham and Lin, Xi and Yih, Wen-tau and Iyer, Srini. Instruction-tuned Language Models are Better Knowledge Learners. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2024. doi:10.18653/v1/202...
-
[70]
2024 , eprint=
Balancing Continuous Pre-Training and Instruction Fine-Tuning: Optimizing Instruction-Following in LLMs , author=. 2024 , eprint=
2024
-
[71]
2025 , eprint=
Efficiently Democratizing Medical LLMs for 50 Languages via a Mixture of Language Family Experts , author=. 2025 , eprint=
2025
-
[72]
2022 , eprint=
TruthfulQA: Measuring How Models Mimic Human Falsehoods , author=. 2022 , eprint=
2022
-
[73]
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing (EMNLP) , pages =
Adaptation Odyssey in LLMs: Why Does Additional Pretraining Sometimes Fail to Improve? , author =. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing (EMNLP) , pages =
2024
-
[74]
2024 , eprint=
PARAMANU-AYN: Pretrain from scratch or Continual Pretraining of LLMs for Legal Domain Adaptation? , author=. 2024 , eprint=
2024
-
[75]
2024 , eprint=
Domain Adaptation of Llama3-70B-Instruct through Continual Pre-Training and Model Merging: A Comprehensive Evaluation , author=. 2024 , eprint=
2024
-
[76]
TCM-GPT: Efficient pre-training of large language models for domain adaptation in Traditional Chinese Medicine , journal =. 2024 , issn =. doi:https://doi.org/10.1016/j.cmpbup.2024.100158 , url =
-
[77]
2025 , eprint=
Small Language Models are the Future of Agentic AI , author=. 2025 , eprint=
2025
-
[78]
2025 , howpublished =
Efficiently Democratizing Medical LLMs for 50 Languages via a Mixture of Language Family Experts , author=. 2025 , howpublished =
2025
-
[79]
2019 , eprint=
HellaSwag: Can a Machine Really Finish Your Sentence? , author=. 2019 , eprint=
2019
-
[80]
2021 , eprint=
Training Verifiers to Solve Math Word Problems , author=. 2021 , eprint=
2021
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.