pith. sign in

arxiv: 2606.01904 · v2 · pith:UBG3A7N5new · submitted 2026-06-01 · 💻 cs.CL · cs.AI

KliniskVestBERT: BERT Model Specialised to Norwegian Clinical Texts

Pith reviewed 2026-06-28 14:30 UTC · model grok-4.3

classification 💻 cs.CL cs.AI
keywords clinical NLPBERT modelsdomain-specific pre-trainingNorwegian languagelanguage model adaptationhealthcare textsencoder models
0
0 comments X

The pith

Specialized pre-training on Norwegian clinical texts makes BERT models outperform general versions on clinical tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper creates three BERT encoder models by continuing pre-training on a large corpus of de-identified Norwegian clinical documents that include discharge summaries, surgical reports, and nursing notes in both bokmål and nynorsk. It reports that each specialized model beats its baseline counterpart on three synthetic clinical benchmarks and two real-world problems. A sympathetic reader would care because clinical language contains domain-specific phrasing and structure that general models may miss, so targeted adaptation could improve automated processing of medical records. The work frames this as evidence that domain-specific pre-training delivers measurable gains for Norwegian clinical NLP.

Core claim

Continuing pre-training of existing BERT-based models on a representative corpus of real-world Norwegian clinical texts produces specialized models that consistently outperform the original baselines on synthetic Norwegian clinical benchmark datasets and real-world clinical problems.

What carries the argument

Continued pre-training of general Norwegian BERT models on a curated corpus of de-identified clinical documents to adapt them to clinical language patterns.

If this is right

  • Domain-specific pre-training provides significant benefit for NLP tasks within the clinical domain.
  • Specialized models handle the linguistic features of Norwegian clinical texts more effectively than general models.
  • The gains apply across multiple document types including discharge summaries and nursing notes.
  • Both bokmål and nynorsk variants benefit from the same adaptation process.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Similar continued pre-training could be tested on clinical texts from other languages with limited resources.
  • The approach might support more accurate information extraction from electronic health records in practice.
  • Future evaluations could include tasks like named entity recognition or relation extraction on clinical data.

Load-bearing premise

The three synthetic benchmarks and two real-world problems used for evaluation are representative of actual clinical NLP use cases.

What would settle it

A follow-up test on additional unseen Norwegian clinical texts or tasks showing no performance gain or a reversal would falsify the consistent outperformance claim.

Figures

Figures reproduced from arXiv: 2606.01904 by Christian Autenried, Cosimo Persia.

Figure 1
Figure 1. Figure 1: precision and recall in percentage. 3.2 Evaluation Dataset To facilitate robust evaluation and development of medical language models, we utilize and present five distinct benchmark datasets, each designed to address specific challenges within the domain. MedMCQA. The first, MedMCQA [13] is a large-scale, Multiple-Choice Question Answering (MCQA) dataset in english constructed from real-world medical entra… view at source ↗
read the original abstract

The increasing application of Natural Language Processing (NLP) in healthcare demands language models specifically attuned to the complexities of clinical language. This work introduces KliniskVestBERT, a suite of three BERT-based encoder models pre-trained on a substantial corpus of real-world, de-identified Norwegian clinical texts from Helse Vest. We continue pretraining existing language models Nb-BERT-large, NorBERT3-large, and ModernBERT on our specialized clinical dataset. This dataset is based on a representative population of Helse Vest patients. The included document types are carefully curated to encompass a broad clinical spectrum in bokm{\aa}l and nynorsk including discharge summaries, surgical reports, nursing notes etc. ensuring comprehensive representation of the linguistic landscape within Norwegian healthcare settings. Evaluation on three synthtetic Norwegian clinical benchmark datasets and two real-world problems demonstrates that each of our clinically specialized models consistently outperforms their baseline counterparts, highlighting the significant benefit of domain-specific pre-training for NLP tasks within the clinical domain. The project was a joint effort by all Helse Vest entities (Helse Bergen, Helse Fonna, Helse F{\o}rde and Helse Stavanger) with DIPS under the project lead of Helse Vest ICT.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 3 minor

Summary. The manuscript introduces KliniskVestBERT, a suite of three BERT encoder models created by continued pre-training of Nb-BERT-large, NorBERT3-large, and ModernBERT on a large corpus of de-identified Norwegian clinical texts from Helse Vest (discharge summaries, surgical reports, nursing notes, etc., in both bokmål and nynorsk). The central claim is that each domain-adapted model consistently outperforms its baseline on three synthetic Norwegian clinical benchmark datasets plus two real-world problems, demonstrating the benefit of domain-specific pre-training for clinical NLP.

Significance. If the results hold and the evaluation is representative, the work would supply concrete evidence that continued pre-training on real Norwegian clinical text improves downstream performance, filling a gap for a low-resource language in the clinical domain. The multi-institutional collaboration with Helse Vest entities lends practical credibility to the corpus construction.

major comments (3)
  1. [Abstract / Evaluation] Abstract and Evaluation section: the claim of 'consistent outperformance' is asserted without any reported metrics, confidence intervals, statistical tests, or dataset sizes; this directly undermines assessment of whether the gains are load-bearing for the generalization to clinical utility.
  2. [Evaluation] Evaluation section (description of the three synthetic benchmarks): no details are supplied on how the synthetic datasets were constructed or validated against real Helse Vest notes; without this, it is impossible to determine whether they capture characteristic clinical phenomena (abbreviation density, negation scope, temporal reasoning, bokmål/nynorsk switching) that would be needed to support the claim that domain adaptation yields broad benefit.
  3. [Abstract] Abstract: the two 'real-world problems' are mentioned but not characterized (task definitions, data sources, sizes, or how they differ from the synthetic benchmarks), leaving the generalization argument dependent on uninspectable evidence.
minor comments (3)
  1. [Abstract] Typo: 'synthtetic' should be 'synthetic'.
  2. [Methods] The manuscript should include a table or section explicitly comparing the three models' pre-training hyperparameters, corpus statistics, and final checkpoint selection criteria.
  3. [Abstract / Data] Minor: the abstract states the dataset 'is based on a representative population' but supplies no supporting statistics on patient demographics or document-type distribution; a short table would improve clarity.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the detailed and constructive feedback. The comments highlight important gaps in the presentation of our evaluation results and dataset descriptions. We agree that these details are necessary for readers to assess the strength of our claims and will revise the manuscript to address each point.

read point-by-point responses
  1. Referee: [Abstract / Evaluation] Abstract and Evaluation section: the claim of 'consistent outperformance' is asserted without any reported metrics, confidence intervals, statistical tests, or dataset sizes; this directly undermines assessment of whether the gains are load-bearing for the generalization to clinical utility.

    Authors: We agree that the current abstract and Evaluation section do not include the specific performance metrics, confidence intervals, statistical tests, or dataset sizes needed to substantiate the 'consistent outperformance' claim. In the revised manuscript we will add a results table (or expanded text) reporting exact scores for each model on each task, along with dataset sizes, confidence intervals where applicable, and the results of statistical significance tests (e.g., McNemar or paired t-tests) comparing the domain-adapted models to their baselines. revision: yes

  2. Referee: [Evaluation] Evaluation section (description of the three synthetic benchmarks): no details are supplied on how the synthetic datasets were constructed or validated against real Helse Vest notes; without this, it is impossible to determine whether they capture characteristic clinical phenomena (abbreviation density, negation scope, temporal reasoning, bokmål/nynorsk switching) that would be needed to support the claim that domain adaptation yields broad benefit.

    Authors: The referee is correct that the manuscript currently provides no information on the construction or validation of the three synthetic benchmarks. We will add a dedicated subsection in the revised Evaluation section that describes (1) the generation process for each synthetic dataset, (2) any manual or automatic validation steps performed against real Helse Vest notes, and (3) how the datasets were designed to reflect clinical phenomena such as abbreviation density, negation scope, temporal reasoning, and language variation between bokmål and nynorsk. revision: yes

  3. Referee: [Abstract] Abstract: the two 'real-world problems' are mentioned but not characterized (task definitions, data sources, sizes, or how they differ from the synthetic benchmarks), leaving the generalization argument dependent on uninspectable evidence.

    Authors: We acknowledge that the abstract (and Evaluation section) only names the two real-world problems without providing task definitions, data sources, sizes, or their relationship to the synthetic benchmarks. In the revision we will expand both the abstract and the main text to include concise characterizations of these tasks, including their definitions, the origin and size of the evaluation data, and how they complement the synthetic benchmarks in demonstrating clinical utility. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical model comparison with no derivations or fitted quantities

full rationale

The paper performs continued pre-training of existing BERT variants on a clinical corpus and reports empirical outperformance on held-out benchmarks. No equations, parameter fits, uniqueness theorems, or predictions are presented that could reduce to the inputs by construction. The evaluation is a standard train-then-test comparison against external baselines (Nb-BERT-large, NorBERT3-large, ModernBERT), with no self-citation load-bearing on any mathematical claim. The result is therefore self-contained against external benchmarks and receives the default non-circularity finding.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that continued pre-training on in-domain clinical text produces better downstream performance; no free parameters, invented entities, or additional axioms are introduced beyond standard transformer training.

axioms (1)
  • domain assumption Domain-specific continued pre-training improves performance on clinical NLP tasks
    Invoked when the abstract states that specialized models outperform baselines.

pith-pipeline@v0.9.1-grok · 5742 in / 1138 out tokens · 27710 ms · 2026-06-28T14:30:16.109329+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

25 extracted references · 15 canonical work pages · 9 internal anchors

  1. [1]

    2025.url:https://www.helse-bergen.no/avdelinger/kirurgisk-serviceklinikk/fag- og-forskingsavdelinga/aismec/(visited on 07/01/2025)

    Guttorm Brattebø.Artificial Intelligence Support in Medical Emergency Calls – AISMEC project. 2025.url:https://www.helse-bergen.no/avdelinger/kirurgisk-serviceklinikk/fag- og-forskingsavdelinga/aismec/(visited on 07/01/2025)

  2. [2]

    BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

    Jacob Devlin et al. “BERT: Pre-training of Deep Bidirectional Transformers for Language Un- derstanding”. In:CoRRabs/1810.04805 (2018). arXiv:1810.04805.url:http://arxiv.org/ abs/1810.04805

  3. [3]

    Kexin Huang, Jaan Altosaar, and Rajesh Ranganath.ClinicalBERT: Modeling Clinical Notes and Predicting Hospital Readmission. 2020. arXiv:1904.05342 [cs.CL].url:https://arxiv. org/abs/1904.05342

  4. [4]

    2020 , note =

    Mandar Joshi et al. “SpanBERT: Improving Pre-training by Representing and Predicting Spans”. In:Transactions of the Association for Computational Linguistics8 (Jan. 2020), pp. 64–77.issn: 2307-387X.doi:10.1162/tacl_a_00300. eprint:https://direct.mit.edu/tacl/article- pdf/doi/10.1162/tacl\_a\_00300/1923170/tacl\_a\_00300.pdf.url:https://doi.org/ 10.1162/ta...

  5. [5]

    Adam: A Method for Stochastic Optimization

    Diederik P. Kingma and Jimmy Ba.Adam: A Method for Stochastic Optimization. 2017. arXiv: 1412.6980 [cs.LG].url:https://arxiv.org/abs/1412.6980

  6. [6]

    The Norwegian Colossal Corpus: A Text Corpus for Training Large Norwegian Language Models

    Per Kummervold, Freddy Wetjen, and Javier de la Rosa. “The Norwegian Colossal Corpus: A Text Corpus for Training Large Norwegian Language Models”. In:Proceedings of the Thirteenth Language Resources and Evaluation Conference. Ed. by Nicoletta Calzolari et al. Marseille, France: European Language Resources Association, June 2022, pp. 3852–3860.url:https : ...

  7. [7]

    Yinhan Liu et al.RoBERTa: A Robustly Optimized BERT Pretraining Approach. 2019. arXiv: 1907.11692 [cs.CL].url:https://arxiv.org/abs/1907.11692

  8. [8]

    Ilya Loshchilov and Frank Hutter.Decoupled Weight Decay Regularization. 2019. arXiv:1711. 05101 [cs.LG].url:https://arxiv.org/abs/1711.05101

  9. [9]

    Instruction-guided deidentification with synthetic test cases for Norwegian clinical text

    Jørgen Aarmo Lund et al. “Instruction-guided deidentification with synthetic test cases for Norwegian clinical text”. In:Proceedings of the 5th Northern Lights Deep Learning Confer- ence (NLDL). Ed. by Tetiana Lutchyn, Ad´ ın Ram´ ırez Rivera, and Benjamin Ricaud. Vol. 233. Proceedings of Machine Learning Research. PMLR, Sept. 2024, pp. 145–152.url:https:...

  10. [10]

    McCreery et al.Effective Transfer Learning for Identifying Similar Questions: Matching User Questions to COVID-19 FAQs

    Clara H. McCreery et al.Effective Transfer Learning for Identifying Similar Questions: Matching User Questions to COVID-19 FAQs. 2020. arXiv:2008.13546 [cs.IR]

  11. [11]

    Domain-Specific Pretraining and Evaluation of NorDeClin-BERT for ICD-10 Code Prediction in Norwegian Clinical Texts

    P. D. Ngo et al. “Domain-Specific Pretraining and Evaluation of NorDeClin-BERT for ICD-10 Code Prediction in Norwegian Clinical Texts”. In:JMIR AI66153 (2025). Forthcoming.doi: 10.2196/66153.url:https://preprints.jmir.org/preprint/66153. 9 REFERENCES REFERENCES

  12. [12]

    The potential for automated question answering in the context of genomic medicine: an assessment of existing resources and properties of answers

    Casey Lynnette Overby, Peter Tarczy-Hornoch, and Dina Demner-Fushman. “The potential for automated question answering in the context of genomic medicine: an assessment of existing resources and properties of answers”. en. In:BMC Bioinformatics10.Suppl 9 (Sept. 2009), S8

  13. [13]

    MedMCQA: A Large- scale Multi-Subject Multi-Choice Dataset for Medical domain Question Answering

    Ankit Pal, Logesh Kumar Umapathi, and Malaikannan Sankarasubbu. “MedMCQA: A Large- scale Multi-Subject Multi-Choice Dataset for Medical domain Question Answering”. In:Pro- ceedings of the Conference on Health, Inference, and Learning. Ed. by Gerardo Flores et al. Vol. 174. Proceedings of Machine Learning Research. PMLR, Apr. 2022, pp. 248–260.url: https:/...

  14. [14]

    Adam Paszke et al.PyTorch: An Imperative Style, High-Performance Deep Learning Library

  15. [15]

    arXiv:1912.01703 [cs.LG].url:https://arxiv.org/abs/1912.01703

  16. [16]

    A novel approach to virtual patient sim- ulation using natural language processing

    Amit Persad, Eleni Stroulia, and Sarah Forgie. “A novel approach to virtual patient sim- ulation using natural language processing”. In:Medical Education50.11 (2016), pp. 1162– 1163.doi:https://doi.org/10.1111/medu.13197. eprint:https://asmepublications. onlinelibrary.wiley.com/doi/pdf/10.1111/medu.13197.url:https://asmepublications. onlinelibrary.wiley.c...

  17. [17]

    Alec Radford et al.Robust Speech Recognition via Large-Scale Weak Supervision. 2022. arXiv: 2212.04356 [eess.AS].url:https://arxiv.org/abs/2212.04356

  18. [18]

    David Samuel et al.NorBench – A Benchmark for Norwegian Language Models. 2023. arXiv: 2305.03880 [cs.CL].url:https://arxiv.org/abs/2305.03880

  19. [19]

    Automated classification of radiology reports for acute lung injury: Compari- son of keyword and machine learning based natural language processing approaches

    Imre Solti et al. “Automated classification of radiology reports for acute lung injury: Compari- son of keyword and machine learning based natural language processing approaches”. In:2009 IEEE International Conference on Bioinformatics and Biomedicine Workshop. Washington, DC: IEEE, Nov. 2009

  20. [20]

    Benjamin Warner et al.Smarter, Better, Faster, Longer: A Modern Bidirectional Encoder for Fast, Memory Efficient, and Long Context Finetuning and Inference. 2024. arXiv:2412.13663 [cs.CL].url:https://arxiv.org/abs/2412.13663

  21. [21]

    2025.url:https://www.helse- bergen.no(visited on 07/01/2025)

    Jannicke Slettli Wathne.KI for uønskede legemiddelhendelser. 2025.url:https://www.helse- bergen.no(visited on 07/01/2025)

  22. [22]

    Thomas Wolf et al.HuggingFace’s Transformers: State-of-the-art Natural Language Processing

  23. [23]

    arXiv:1910.03771 [cs.CL].url:https://arxiv.org/abs/1910.03771

  24. [24]

    Mitchell Wortsman et al.Stable and low-precision training for large-scale vision-language models

  25. [25]

    arXiv:2304.13013 [cs.LG].url:https://arxiv.org/abs/2304.13013. 10