pith. sign in

arxiv: 2606.28538 · v1 · pith:Y4Q44JX3new · submitted 2026-06-26 · 💻 cs.CL

Legal Domain Adaptation of Modern BERT Models

Pith reviewed 2026-06-30 01:21 UTC · model grok-4.3

classification 💻 cs.CL
keywords domain adaptationModernBERTlegal domainmasked language modelingUS court opinionspre-traininglong contextembeddings
0
0 comments X

The pith

Further pre-training ModernBERT on US court opinions yields significant gains on legal tasks even after its initial large-scale training.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether ModernBERT, already trained on vastly more data than the original BERT, still gains from additional pre-training on legal text. It continues pre-training the model on all US court opinions using the masked language modeling objective and evaluates the result on datasets tied to those opinions. The adapted model shows clear improvements over the unadapted version, matching the scale of gains reported in early BERT domain-adaptation studies. From-scratch training on the same data falls short of adapting an existing checkpoint. The resulting models handle sequences up to 8,192 tokens and support legal passage embeddings or fast reranking.

Core claim

Although ModernBERT has been trained on roughly 500x more data than original BERT, further pre-training on all US court opinions using the masked language modeling objective produces significant improvements compared to vanilla ModernBERT on all datasets connected to US court opinions, with gains similar to those reported in early work on domain adaptation of BERT-like models; from-scratch pre-training does not match the performance of further pre-training an existing checkpoint.

What carries the argument

Continued pre-training via the masked language modeling objective on the full set of US court opinions, which adapts representations for legal content while preserving the model's native capacity for sequences up to 8,192 tokens.

If this is right

  • Significant improvements over vanilla ModernBERT on every tested dataset tied to US court opinions.
  • Gains of the same magnitude as those found in early domain-adaptation experiments with BERT-like models.
  • From-scratch pre-training on the same legal corpus underperforms adaptation of an existing ModernBERT checkpoint.
  • The adapted models support computation of embeddings for legal passages and rapid reranking of hundreds of passages per query.
  • All resulting checkpoints are released for public use.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same continued-pre-training recipe could be tested on other narrow domains such as medicine or finance.
  • The 8k-token context length opens the possibility of processing entire long legal documents in a single pass rather than chunking.
  • Efficiency comparisons between adapting existing large models versus training new ones from scratch become relevant for resource planning.
  • Downstream legal applications may now incorporate these embeddings for retrieval or classification without additional fine-tuning data.

Load-bearing premise

That masked language modeling on court opinions will produce better representations for downstream legal tasks.

What would settle it

No measurable improvement on the US court opinion datasets, or equivalent results from a from-scratch model, would falsify the benefit of further pre-training.

Figures

Figures reproduced from arXiv: 2606.28538 by Dominik Stammbach, Peter Henderson.

Figure 1
Figure 1. Figure 1: Loss curves during pre-training: number of steps on the x-axis, loss on the y-axis. Top row displays curves initialized [PITH_FULL_IMAGE:figures/full_fig_p006_1.png] view at source ↗
read the original abstract

We investigate domain adaptation of modern BERT models in the legal domain. We further pre-train ModernBERT on all US court opinions using the masked language modeling objective. Although ModernBERT has been trained on roughly 500x more data than original BERT, we still find that this model benefits from further pre-training and domain adaptation in the legal domain: we report significant improvements compared to vanilla ModernBERT on all datasets connected to US court opinions. We find gains similar to those reported in early work on domain adaptation of BERT-like models. However, from scratch pre-training does not match the performance of further pre-training an existing ModernBERT checkpoint in our experiments. The resulting models are capable of processing sequences up to 8,192 tokens, and can be used to compute meaningful embeddings of legal passages, or could quickly rerank hundreds of legal passages for a given search query. We release all model checkpoints publicly.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The manuscript claims that further pre-training ModernBERT on US court opinions via masked language modeling produces significant gains over the vanilla checkpoint on legal downstream tasks, with gains comparable to early BERT domain-adaptation results; from-scratch pre-training underperforms continued pre-training; the resulting models support 8192-token sequences and are released publicly for embedding and reranking use cases.

Significance. If the empirical claims hold after details are supplied, the work would show that domain adaptation can still be useful even after 500x more pre-training data than original BERT, with direct utility for legal NLP. The public release of checkpoints is a clear strength for reproducibility.

major comments (3)
  1. [Abstract] Abstract: the central claim of 'significant improvements compared to vanilla ModernBERT on all datasets connected to US court opinions' is unsupported by any reported corpus size, token count, training steps, learning-rate schedule, batch size, statistical tests, or variance across runs, making it impossible to verify whether domain adaptation (rather than extra compute) is responsible.
  2. [Abstract] Abstract: the statement that 'from scratch pre-training does not match the performance of further pre-training' lacks confirmation that equivalent total training budget, optimizer settings, or data volume were used in the from-scratch run, which is load-bearing for the claim that continued pre-training of an existing checkpoint is preferable.
  3. [Abstract] Abstract: no ablation or control experiment is described that holds total compute fixed while varying only domain specificity, leaving open the possibility that observed gains are explained by optimization differences or additional steps rather than the legal corpus.
minor comments (2)
  1. The phrase 'all datasets connected to US court opinions' is vague; explicit dataset names, sizes, and task formulations would improve clarity.
  2. Consider adding a results table with exact metrics, baselines, and significance markers to replace the qualitative 'significant improvements' phrasing.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the detailed feedback on the need for greater experimental transparency. We will revise the manuscript to expand the abstract and add a methods section with the requested details on training configuration, corpus statistics, and controls.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the central claim of 'significant improvements compared to vanilla ModernBERT on all datasets connected to US court opinions' is unsupported by any reported corpus size, token count, training steps, learning-rate schedule, batch size, statistical tests, or variance across runs, making it impossible to verify whether domain adaptation (rather than extra compute) is responsible.

    Authors: We agree that the abstract does not currently report these details. In the revision we will add the US court opinions corpus size in tokens, total training steps, learning-rate schedule, batch size, results with standard deviation across runs, and statistical significance tests. These additions will allow readers to assess whether the gains stem from domain adaptation rather than extra compute. revision: yes

  2. Referee: [Abstract] Abstract: the statement that 'from scratch pre-training does not match the performance of further pre-training' lacks confirmation that equivalent total training budget, optimizer settings, or data volume were used in the from-scratch run, which is load-bearing for the claim that continued pre-training of an existing checkpoint is preferable.

    Authors: The referee is correct that the abstract omits explicit confirmation of matched budgets. We will revise the text to state that the from-scratch run used an equivalent total training budget, identical optimizer settings, and the same data volume as the continued pre-training run; full hyperparameter tables will be added to the methods section. revision: yes

  3. Referee: [Abstract] Abstract: no ablation or control experiment is described that holds total compute fixed while varying only domain specificity, leaving open the possibility that observed gains are explained by optimization differences or additional steps rather than the legal corpus.

    Authors: We acknowledge the value of an explicit ablation that isolates domain specificity while holding total compute constant. Our existing from-scratch versus continued-pretraining comparison already matches total compute and data volume; we will expand the discussion to clarify this control and explicitly note the remaining limitations. If space allows we will add a short additional control experiment. revision: partial

Circularity Check

0 steps flagged

No circularity: empirical domain-adaptation results rest on measured performance, not derivations or self-referential fits

full rationale

The paper is an empirical report of continued pre-training of ModernBERT on US court opinions via masked language modeling, followed by evaluation on legal datasets. No equations, uniqueness theorems, or first-principles derivations are present; performance gains are stated as direct experimental measurements. The central claim (further pre-training yields improvements) is not reduced to fitted parameters by construction, nor does it rely on self-citation chains for load-bearing premises. This matches the default case of a self-contained empirical study with no circular steps.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Abstract-only review provides no explicit free parameters, axioms, or invented entities; the work implicitly relies on the standard assumption that masked language modeling pre-training improves downstream task performance.

axioms (1)
  • domain assumption Masked language modeling pre-training on domain-specific text improves model representations for downstream tasks in that domain
    Invoked by the decision to further pre-train with MLM and the expectation of downstream gains; standard in BERT literature but not proven in the abstract.

pith-pipeline@v0.9.1-grok · 5670 in / 1234 out tokens · 27061 ms · 2026-06-30T01:21:07.802182+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

39 extracted references · 10 canonical work pages

  1. [1]

    Abubakar Abid, Maheen Farooqi, and James Zou. 2021. Persistent anti-muslim bias in large language models. InProceedings of the 2021 AAAI/ACM Conference on AI, Ethics, and Society. 298–306. https://dl.acm.org/doi/abs/10.1145/3461702. 3462624

  2. [2]

    Elliott Ash, Aniket Kesari, Suresh Naidu, Lena Song, and Dominik Stammbach

  3. [3]

    InProceedings of the Symposium on Computer Science and Law(Boston, MA, USA)(CSLA W ’24)

    Translating Legalese: Enhancing Public Understanding of Court Opinions with Legal Summarizers. InProceedings of the Symposium on Computer Science and Law(Boston, MA, USA)(CSLA W ’24). Association for Computing Machinery, New York, NY, USA, 136–157. doi:10.1145/3614407.3643700

  4. [4]

    Iz Beltagy, Kyle Lo, and Arman Cohan. 2019. SciBERT: A Pretrained Language Model for Scientific Text. InProceedings of the 2019 Conference on Empirical Meth- ods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Kentaro Inui, Jing Jiang, Vincent Ng, and Xiaojun Wan (Eds.). Association ...

  5. [5]

    Ilias Chalkidis, Manos Fergadiotis, Prodromos Malakasiotis, Nikolaos Aletras, and Ion Androutsopoulos. 2020. LEGAL-BERT: The Muppets straight out of Law School. InFindings of the Association for Computational Linguistics: EMNLP 2020, Trevor Cohn, Yulan He, and Yang Liu (Eds.). Association for Computational Linguistics, Online, 2898–2904. doi:10.18653/v1/2...

  6. [6]

    Ilias Chalkidis, Nicolas Garneau, Catalina Goanta, Daniel Katz, and Anders Søgaard. 2023. LeXFiles and LegalLAMA: Facilitating English Multinational Legal Language Model Development. InProceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Computational Linguistics, Toronto, Canada,...

  7. [7]

    Ilias Chalkidis, Abhik Jana, Dirk Hartung, Michael Bommarito, Ion Androut- sopoulos, Daniel Katz, and Nikolaos Aletras. 2022. LexGLUE: A Benchmark Dataset for Legal Language Understanding in English. InProceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Smaranda Muresan, Preslav Nakov, and Aline...

  8. [8]

    Inyoung Cheong, Patty Liu, Dominik Stammbach, and Peter Henderson. 2026. How Can AI Augment Access to Justice? Public Defenders’ Perspectives on AI Adoption. arXiv:2510.22933 [cs.CY] https://arxiv.org/abs/2510.22933

  9. [9]

    Inyoung Cheong, King Xia, KJ Kevin Feng, Quan Ze Chen, and Amy X Zhang

  10. [10]

    InProceedings of the 2024 ACM Conference on Fairness, Accountability, and Transparency

    (A) I am not a lawyer, but...: engaging legal experts towards responsible LLM policies for legal advice. InProceedings of the 2024 ACM Conference on Fairness, Accountability, and Transparency. 2454–2469

  11. [11]

    Tri Dao. 2023. FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning. arXiv:2307.08691 [cs.LG] https://arxiv.org/abs/2307.08691

  12. [12]

    Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Jill Burstein, Christy...

  13. [13]

    Ricardo Dominguez-Olmedo, Vedant Nanda, Rediet Abebe, Stefan Bechtold, Christoph Engel, Jens Frankenreiter, Krishna Gummadi, Moritz Hardt, and Michael Livermore. 2024. Lawma: The Power of Specialization for Legal Tasks. arXiv:2407.16615 [cs.CL] https://arxiv.org/abs/2407.16615

  14. [14]

    Aran Komatsuzaki Enrico Shippole. 2024. Cleaned Caselaw Access Project. https://huggingface.co/datasets/TeraflopAI/Caselaw_Access_Project

  15. [15]

    Adam Feldman. 2018. Empirical SCOTUS: An opinion is worth at least a thousand words (Corrected). https://www.scotusblog.com/2018/04/empirical-scotus-an- opinion-is-worth-at-least-a-thousand-words/ SCOTUSblog, Accessed: 2025-02- 01

  16. [16]

    Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, and et al. 2024. The Llama 3 Herd of Models. arXiv:2407.21783 [cs.AI] https://arxiv.org/abs/2407.21783

  17. [17]

    Krass, Lucia Zheng, Neel Guha, Christopher D

    Peter Henderson, Mark S. Krass, Lucia Zheng, Neel Guha, Christopher D. Man- ning, Dan Jurafsky, and Daniel E. Ho. 2022. Pile of Law: Learning Respon- sible Data Filtering from the Law and a 256GB Open-Source Legal Dataset. https://arxiv.org/abs/2207.00220

  18. [18]

    Harvard Library Innovation Lab. 2024. Cold Cases Dataset. https://huggingface. co/datasets/harvard-lil/cold-cases Accessed: 2025-02-01

  19. [19]

    Li Lucy and David Bamman. 2021. Gender and representation bias in GPT-3 gen- erated stories. InProceedings of the Third Workshop on Narrative Understanding. 48–55. https://aclanthology.org/2021.nuse-1.5/

  20. [20]

    Robert Mahari, Dominik Stammbach, Elliott Ash, and Alex Pentland. 2023. The Law and NLP: Bridging Disciplinary Disconnects. InFindings of the Association for Computational Linguistics: EMNLP 2023, Houda Bouamor, Juan Pino, and Kalika Bali (Eds.). Association for Computational Linguistics, Singapore, 3445–

  21. [21]

    doi:10.18653/v1/2023.findings-emnlp.224

  22. [22]

    Robert Mahari, Dominik Stammbach, Elliott Ash, and Alex Pentland. 2024. LePaRD: A Large-Scale Dataset of Judicial Citations to Precedent. InProceed- ings of the 62nd Annual Meeting of the Association for Computational Linguis- tics (Volume 1: Long Papers), Lun-Wei Ku, Andre Martins, and Vivek Srikumar (Eds.). Association for Computational Linguistics, Ban...

  23. [23]

    Joel Niklaus, Veton Matoshi, Matthias Stürmer, Ilias Chalkidis, and Daniel E. Ho. 2024. MultiLegalPile: A 689GB Multilingual Legal Corpus. arXiv:2306.02069 [cs.CL] https://arxiv.org/abs/2306.02069

  24. [24]

    McCarthy, Christopher Hahn, Brian M

    Joel Niklaus, Lucia Zheng, Arya D. McCarthy, Christopher Hahn, Brian M. Rosen, Peter Henderson, Daniel E. Ho, Garrett Honke, Percy Liang, and Christopher Man- ning. 2025. LawInstruct: A Resource for Studying Language Model Adaptation to the Legal Domain. arXiv:2404.02127 [cs.CL] https://arxiv.org/abs/2404.02127

  25. [25]

    Travisano

    Adam Paine and Robert M. Travisano. 2025. Discovery Pitfalls in the Age of AI. https://techcrunch.com/2025/07/25/sam-altman-warns-theres-no-legal- confidentiality-when-using-chatgpt-as-a-therapist/.Epstein Becker Green(Sep- tember 2025)

  26. [26]

    Jacob Portes, Alex Trott, Sam Havens, Daniel King, Abhinav Venigalla, Moin Nadeem, Nikhil Sardana, Daya Khudia, and Jonathan Frankle. 2023. MosaicBERT: a bidirectional encoder optimized for fast pretraining. InProceedings of the 37th International Conference on Neural Information Processing Systems(New Orleans, LA, USA)(NIPS ’23). Curran Associates Inc., ...

  27. [27]

    Nils Reimers and Iryna Gurevych. 2019. Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks. InProceedings of the 2019 Conference on Em- pirical Methods in Natural Language Processing. Association for Computational Linguistics. https://arxiv.org/abs/1908.10084

  28. [28]

    Spaeth, Lee Epstein, Jeffrey A

    Harold J. Spaeth, Lee Epstein, Jeffrey A. Segal, Andrew D. Martin, Theodore J. Ruger, and Sara C. Benesh. 2020. Supreme Court Database, Version 2020 Release

  29. [29]

    Accessed: 2025-02-01

    Washington University Law. Accessed: 2025-02-01

  30. [30]

    Dominik Stammbach, Kylie Zhang, Patty Liu, Nimra Nadeem, Inyoung Cheong, Lucia Zheng, and Peter Henderson. 2026. Legal Retrieval for Public Defenders. arXiv:2601.14348 [cs.IR] https://arxiv.org/abs/2601.14348

  31. [31]

    Jianlin Su, Murtadha Ahmed, Yu Lu, Shengfeng Pan, Wen Bo, and Yunfeng Liu. 2024. RoFormer: Enhanced transformer with Rotary Position Embedding. Neurocomputing568 (2024), 127063. doi:10.1016/j.neucom.2023.127063

  32. [32]

    Tran, Mostafa Dehghani, Jianmo Ni, Dara Bahri, Harsh Mehta, Zhen Qin, Kai Hui, Zhe Zhao, Jai Gupta, Tal Schuster, William W

    Yi Tay, Vinh Q. Tran, Mostafa Dehghani, Jianmo Ni, Dara Bahri, Harsh Mehta, Zhen Qin, Kai Hui, Zhe Zhao, Jai Gupta, Tal Schuster, William W. Cohen, and Donald Metzler. 2022. Transformer Memory as a Differentiable Search Index. arXiv:2202.06991 [cs.CL] https://arxiv.org/abs/2202.06991

  33. [33]

    Benjamin Warner, Antoine Chaffin, Benjamin Clavié, Orion Weller, Oskar Hall- ström, Said Taghadouini, Alexis Gallagher, Raja Biswas, Faisal Ladhak, Tom Aarsen, Nathan Cooper, Griffin Adams, Jeremy Howard, and Iacopo Poli. 2024. Smarter, Better, Faster, Longer: A Modern Bidirectional Encoder for Fast, Memory Efficient, and Long Context Finetuning and Infer...

  34. [34]

    Nicolas Webersinke, Mathias Kraus, Julia Anna Bingler, and Markus Leippold

  35. [35]

    arXiv:2110.12010 [cs.CL] https://arxiv.org/abs/2110.12010

    ClimateBert: A Pretrained Language Model for Climate-Related Text. arXiv:2110.12010 [cs.CL] https://arxiv.org/abs/2110.12010

  36. [36]

    Anderson, Peter Henderson, and Daniel E

    Lucia Zheng, Neel Guha, Brandon R. Anderson, Peter Henderson, and Daniel E. Ho. 2021. When Does Pretraining Help? Assessing Self-Supervised Learning for Law and the CaseHOLD Dataset. arXiv:2104.08671 [cs.CL] https://arxiv.org/abs/ 2104.08671

  37. [37]

    Manning, Peter Henderson, and Daniel E

    Lucia Zheng, Neel Guha, Javokhir Arifov, Sarah Zhang, Michal Skreta, Christo- pher D. Manning, Peter Henderson, and Daniel E. Ho. 2025. A Reasoning-Focused Legal Retrieval Benchmark. InProceedings of the 2025 Symposium on Computer Science and Law(Munich, Germany)(CSLA W ’25). Association for Computing Machinery, New York, NY, USA, 169–193. doi:10.1145/370...

  38. [38]

    Haoxi Zhong, Chaojun Xiao, Cunchao Tu, Tianyang Zhang, Zhiyuan Liu, and Maosong Sun. 2020. How Does NLP Benefit Legal System: A Summary of Legal Artificial Intelligence. InProceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Dan Jurafsky, Joyce Chai, Natalie Schluter, and Joel Tetreault (Eds.). Association for Computati...

  39. [39]

    Liu Zhuang, Lin Wayne, Shi Ya, and Zhao Jun. 2021. A Robustly Optimized BERT Pre-training Approach with Post-training. InProceedings of the 20th Chinese National Conference on Computational Linguistics, Sheng Li, Maosong Sun, Yang Liu, Hua Wu, Kang Liu, Wanxiang Che, Shizhu He, and Gaoqi Rao (Eds.). Chinese Information Processing Society of China, Huhhot,...