KliniskVestBERT: BERT Model Specialised to Norwegian Clinical Texts
Pith reviewed 2026-06-28 14:30 UTC · model grok-4.3
The pith
Specialized pre-training on Norwegian clinical texts makes BERT models outperform general versions on clinical tasks.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Continuing pre-training of existing BERT-based models on a representative corpus of real-world Norwegian clinical texts produces specialized models that consistently outperform the original baselines on synthetic Norwegian clinical benchmark datasets and real-world clinical problems.
What carries the argument
Continued pre-training of general Norwegian BERT models on a curated corpus of de-identified clinical documents to adapt them to clinical language patterns.
If this is right
- Domain-specific pre-training provides significant benefit for NLP tasks within the clinical domain.
- Specialized models handle the linguistic features of Norwegian clinical texts more effectively than general models.
- The gains apply across multiple document types including discharge summaries and nursing notes.
- Both bokmål and nynorsk variants benefit from the same adaptation process.
Where Pith is reading between the lines
- Similar continued pre-training could be tested on clinical texts from other languages with limited resources.
- The approach might support more accurate information extraction from electronic health records in practice.
- Future evaluations could include tasks like named entity recognition or relation extraction on clinical data.
Load-bearing premise
The three synthetic benchmarks and two real-world problems used for evaluation are representative of actual clinical NLP use cases.
What would settle it
A follow-up test on additional unseen Norwegian clinical texts or tasks showing no performance gain or a reversal would falsify the consistent outperformance claim.
Figures
read the original abstract
The increasing application of Natural Language Processing (NLP) in healthcare demands language models specifically attuned to the complexities of clinical language. This work introduces KliniskVestBERT, a suite of three BERT-based encoder models pre-trained on a substantial corpus of real-world, de-identified Norwegian clinical texts from Helse Vest. We continue pretraining existing language models Nb-BERT-large, NorBERT3-large, and ModernBERT on our specialized clinical dataset. This dataset is based on a representative population of Helse Vest patients. The included document types are carefully curated to encompass a broad clinical spectrum in bokm{\aa}l and nynorsk including discharge summaries, surgical reports, nursing notes etc. ensuring comprehensive representation of the linguistic landscape within Norwegian healthcare settings. Evaluation on three synthtetic Norwegian clinical benchmark datasets and two real-world problems demonstrates that each of our clinically specialized models consistently outperforms their baseline counterparts, highlighting the significant benefit of domain-specific pre-training for NLP tasks within the clinical domain. The project was a joint effort by all Helse Vest entities (Helse Bergen, Helse Fonna, Helse F{\o}rde and Helse Stavanger) with DIPS under the project lead of Helse Vest ICT.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces KliniskVestBERT, a suite of three BERT encoder models created by continued pre-training of Nb-BERT-large, NorBERT3-large, and ModernBERT on a large corpus of de-identified Norwegian clinical texts from Helse Vest (discharge summaries, surgical reports, nursing notes, etc., in both bokmål and nynorsk). The central claim is that each domain-adapted model consistently outperforms its baseline on three synthetic Norwegian clinical benchmark datasets plus two real-world problems, demonstrating the benefit of domain-specific pre-training for clinical NLP.
Significance. If the results hold and the evaluation is representative, the work would supply concrete evidence that continued pre-training on real Norwegian clinical text improves downstream performance, filling a gap for a low-resource language in the clinical domain. The multi-institutional collaboration with Helse Vest entities lends practical credibility to the corpus construction.
major comments (3)
- [Abstract / Evaluation] Abstract and Evaluation section: the claim of 'consistent outperformance' is asserted without any reported metrics, confidence intervals, statistical tests, or dataset sizes; this directly undermines assessment of whether the gains are load-bearing for the generalization to clinical utility.
- [Evaluation] Evaluation section (description of the three synthetic benchmarks): no details are supplied on how the synthetic datasets were constructed or validated against real Helse Vest notes; without this, it is impossible to determine whether they capture characteristic clinical phenomena (abbreviation density, negation scope, temporal reasoning, bokmål/nynorsk switching) that would be needed to support the claim that domain adaptation yields broad benefit.
- [Abstract] Abstract: the two 'real-world problems' are mentioned but not characterized (task definitions, data sources, sizes, or how they differ from the synthetic benchmarks), leaving the generalization argument dependent on uninspectable evidence.
minor comments (3)
- [Abstract] Typo: 'synthtetic' should be 'synthetic'.
- [Methods] The manuscript should include a table or section explicitly comparing the three models' pre-training hyperparameters, corpus statistics, and final checkpoint selection criteria.
- [Abstract / Data] Minor: the abstract states the dataset 'is based on a representative population' but supplies no supporting statistics on patient demographics or document-type distribution; a short table would improve clarity.
Simulated Author's Rebuttal
We thank the referee for the detailed and constructive feedback. The comments highlight important gaps in the presentation of our evaluation results and dataset descriptions. We agree that these details are necessary for readers to assess the strength of our claims and will revise the manuscript to address each point.
read point-by-point responses
-
Referee: [Abstract / Evaluation] Abstract and Evaluation section: the claim of 'consistent outperformance' is asserted without any reported metrics, confidence intervals, statistical tests, or dataset sizes; this directly undermines assessment of whether the gains are load-bearing for the generalization to clinical utility.
Authors: We agree that the current abstract and Evaluation section do not include the specific performance metrics, confidence intervals, statistical tests, or dataset sizes needed to substantiate the 'consistent outperformance' claim. In the revised manuscript we will add a results table (or expanded text) reporting exact scores for each model on each task, along with dataset sizes, confidence intervals where applicable, and the results of statistical significance tests (e.g., McNemar or paired t-tests) comparing the domain-adapted models to their baselines. revision: yes
-
Referee: [Evaluation] Evaluation section (description of the three synthetic benchmarks): no details are supplied on how the synthetic datasets were constructed or validated against real Helse Vest notes; without this, it is impossible to determine whether they capture characteristic clinical phenomena (abbreviation density, negation scope, temporal reasoning, bokmål/nynorsk switching) that would be needed to support the claim that domain adaptation yields broad benefit.
Authors: The referee is correct that the manuscript currently provides no information on the construction or validation of the three synthetic benchmarks. We will add a dedicated subsection in the revised Evaluation section that describes (1) the generation process for each synthetic dataset, (2) any manual or automatic validation steps performed against real Helse Vest notes, and (3) how the datasets were designed to reflect clinical phenomena such as abbreviation density, negation scope, temporal reasoning, and language variation between bokmål and nynorsk. revision: yes
-
Referee: [Abstract] Abstract: the two 'real-world problems' are mentioned but not characterized (task definitions, data sources, sizes, or how they differ from the synthetic benchmarks), leaving the generalization argument dependent on uninspectable evidence.
Authors: We acknowledge that the abstract (and Evaluation section) only names the two real-world problems without providing task definitions, data sources, sizes, or their relationship to the synthetic benchmarks. In the revision we will expand both the abstract and the main text to include concise characterizations of these tasks, including their definitions, the origin and size of the evaluation data, and how they complement the synthetic benchmarks in demonstrating clinical utility. revision: yes
Circularity Check
No circularity: purely empirical model comparison with no derivations or fitted quantities
full rationale
The paper performs continued pre-training of existing BERT variants on a clinical corpus and reports empirical outperformance on held-out benchmarks. No equations, parameter fits, uniqueness theorems, or predictions are presented that could reduce to the inputs by construction. The evaluation is a standard train-then-test comparison against external baselines (Nb-BERT-large, NorBERT3-large, ModernBERT), with no self-citation load-bearing on any mathematical claim. The result is therefore self-contained against external benchmarks and receives the default non-circularity finding.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Domain-specific continued pre-training improves performance on clinical NLP tasks
Reference graph
Works this paper leans on
-
[1]
2025.url:https://www.helse-bergen.no/avdelinger/kirurgisk-serviceklinikk/fag- og-forskingsavdelinga/aismec/(visited on 07/01/2025)
Guttorm Brattebø.Artificial Intelligence Support in Medical Emergency Calls – AISMEC project. 2025.url:https://www.helse-bergen.no/avdelinger/kirurgisk-serviceklinikk/fag- og-forskingsavdelinga/aismec/(visited on 07/01/2025)
2025
-
[2]
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
Jacob Devlin et al. “BERT: Pre-training of Deep Bidirectional Transformers for Language Un- derstanding”. In:CoRRabs/1810.04805 (2018). arXiv:1810.04805.url:http://arxiv.org/ abs/1810.04805
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[3]
Kexin Huang, Jaan Altosaar, and Rajesh Ranganath.ClinicalBERT: Modeling Clinical Notes and Predicting Hospital Readmission. 2020. arXiv:1904.05342 [cs.CL].url:https://arxiv. org/abs/1904.05342
work page internal anchor Pith review Pith/arXiv arXiv 2020
-
[4]
Mandar Joshi et al. “SpanBERT: Improving Pre-training by Representing and Predicting Spans”. In:Transactions of the Association for Computational Linguistics8 (Jan. 2020), pp. 64–77.issn: 2307-387X.doi:10.1162/tacl_a_00300. eprint:https://direct.mit.edu/tacl/article- pdf/doi/10.1162/tacl\_a\_00300/1923170/tacl\_a\_00300.pdf.url:https://doi.org/ 10.1162/ta...
-
[5]
Adam: A Method for Stochastic Optimization
Diederik P. Kingma and Jimmy Ba.Adam: A Method for Stochastic Optimization. 2017. arXiv: 1412.6980 [cs.LG].url:https://arxiv.org/abs/1412.6980
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[6]
The Norwegian Colossal Corpus: A Text Corpus for Training Large Norwegian Language Models
Per Kummervold, Freddy Wetjen, and Javier de la Rosa. “The Norwegian Colossal Corpus: A Text Corpus for Training Large Norwegian Language Models”. In:Proceedings of the Thirteenth Language Resources and Evaluation Conference. Ed. by Nicoletta Calzolari et al. Marseille, France: European Language Resources Association, June 2022, pp. 3852–3860.url:https : ...
2022
-
[7]
Yinhan Liu et al.RoBERTa: A Robustly Optimized BERT Pretraining Approach. 2019. arXiv: 1907.11692 [cs.CL].url:https://arxiv.org/abs/1907.11692
work page internal anchor Pith review Pith/arXiv arXiv 2019
-
[8]
Ilya Loshchilov and Frank Hutter.Decoupled Weight Decay Regularization. 2019. arXiv:1711. 05101 [cs.LG].url:https://arxiv.org/abs/1711.05101
work page internal anchor Pith review Pith/arXiv arXiv 2019
-
[9]
Instruction-guided deidentification with synthetic test cases for Norwegian clinical text
Jørgen Aarmo Lund et al. “Instruction-guided deidentification with synthetic test cases for Norwegian clinical text”. In:Proceedings of the 5th Northern Lights Deep Learning Confer- ence (NLDL). Ed. by Tetiana Lutchyn, Ad´ ın Ram´ ırez Rivera, and Benjamin Ricaud. Vol. 233. Proceedings of Machine Learning Research. PMLR, Sept. 2024, pp. 145–152.url:https:...
2024
-
[10]
Clara H. McCreery et al.Effective Transfer Learning for Identifying Similar Questions: Matching User Questions to COVID-19 FAQs. 2020. arXiv:2008.13546 [cs.IR]
-
[11]
P. D. Ngo et al. “Domain-Specific Pretraining and Evaluation of NorDeClin-BERT for ICD-10 Code Prediction in Norwegian Clinical Texts”. In:JMIR AI66153 (2025). Forthcoming.doi: 10.2196/66153.url:https://preprints.jmir.org/preprint/66153. 9 REFERENCES REFERENCES
work page doi:10.2196/66153.url:https://preprints.jmir.org/preprint/66153 2025
-
[12]
The potential for automated question answering in the context of genomic medicine: an assessment of existing resources and properties of answers
Casey Lynnette Overby, Peter Tarczy-Hornoch, and Dina Demner-Fushman. “The potential for automated question answering in the context of genomic medicine: an assessment of existing resources and properties of answers”. en. In:BMC Bioinformatics10.Suppl 9 (Sept. 2009), S8
2009
-
[13]
MedMCQA: A Large- scale Multi-Subject Multi-Choice Dataset for Medical domain Question Answering
Ankit Pal, Logesh Kumar Umapathi, and Malaikannan Sankarasubbu. “MedMCQA: A Large- scale Multi-Subject Multi-Choice Dataset for Medical domain Question Answering”. In:Pro- ceedings of the Conference on Health, Inference, and Learning. Ed. by Gerardo Flores et al. Vol. 174. Proceedings of Machine Learning Research. PMLR, Apr. 2022, pp. 248–260.url: https:/...
2022
-
[14]
Adam Paszke et al.PyTorch: An Imperative Style, High-Performance Deep Learning Library
-
[15]
arXiv:1912.01703 [cs.LG].url:https://arxiv.org/abs/1912.01703
work page internal anchor Pith review Pith/arXiv arXiv 1912
-
[16]
A novel approach to virtual patient sim- ulation using natural language processing
Amit Persad, Eleni Stroulia, and Sarah Forgie. “A novel approach to virtual patient sim- ulation using natural language processing”. In:Medical Education50.11 (2016), pp. 1162– 1163.doi:https://doi.org/10.1111/medu.13197. eprint:https://asmepublications. onlinelibrary.wiley.com/doi/pdf/10.1111/medu.13197.url:https://asmepublications. onlinelibrary.wiley.c...
-
[17]
Alec Radford et al.Robust Speech Recognition via Large-Scale Weak Supervision. 2022. arXiv: 2212.04356 [eess.AS].url:https://arxiv.org/abs/2212.04356
work page internal anchor Pith review Pith/arXiv arXiv 2022
- [18]
-
[19]
Automated classification of radiology reports for acute lung injury: Compari- son of keyword and machine learning based natural language processing approaches
Imre Solti et al. “Automated classification of radiology reports for acute lung injury: Compari- son of keyword and machine learning based natural language processing approaches”. In:2009 IEEE International Conference on Bioinformatics and Biomedicine Workshop. Washington, DC: IEEE, Nov. 2009
2009
-
[20]
Benjamin Warner et al.Smarter, Better, Faster, Longer: A Modern Bidirectional Encoder for Fast, Memory Efficient, and Long Context Finetuning and Inference. 2024. arXiv:2412.13663 [cs.CL].url:https://arxiv.org/abs/2412.13663
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[21]
2025.url:https://www.helse- bergen.no(visited on 07/01/2025)
Jannicke Slettli Wathne.KI for uønskede legemiddelhendelser. 2025.url:https://www.helse- bergen.no(visited on 07/01/2025)
2025
-
[22]
Thomas Wolf et al.HuggingFace’s Transformers: State-of-the-art Natural Language Processing
-
[23]
arXiv:1910.03771 [cs.CL].url:https://arxiv.org/abs/1910.03771
work page internal anchor Pith review Pith/arXiv arXiv 1910
-
[24]
Mitchell Wortsman et al.Stable and low-precision training for large-scale vision-language models
- [25]
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.