Legal Domain Adaptation of Modern BERT Models

Dominik Stammbach; Peter Henderson

arxiv: 2606.28538 · v1 · pith:Y4Q44JX3new · submitted 2026-06-26 · 💻 cs.CL

Legal Domain Adaptation of Modern BERT Models

Dominik Stammbach , Peter Henderson This is my paper

Pith reviewed 2026-06-30 01:21 UTC · model grok-4.3

classification 💻 cs.CL

keywords domain adaptationModernBERTlegal domainmasked language modelingUS court opinionspre-traininglong contextembeddings

0 comments

The pith

Further pre-training ModernBERT on US court opinions yields significant gains on legal tasks even after its initial large-scale training.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether ModernBERT, already trained on vastly more data than the original BERT, still gains from additional pre-training on legal text. It continues pre-training the model on all US court opinions using the masked language modeling objective and evaluates the result on datasets tied to those opinions. The adapted model shows clear improvements over the unadapted version, matching the scale of gains reported in early BERT domain-adaptation studies. From-scratch training on the same data falls short of adapting an existing checkpoint. The resulting models handle sequences up to 8,192 tokens and support legal passage embeddings or fast reranking.

Core claim

Although ModernBERT has been trained on roughly 500x more data than original BERT, further pre-training on all US court opinions using the masked language modeling objective produces significant improvements compared to vanilla ModernBERT on all datasets connected to US court opinions, with gains similar to those reported in early work on domain adaptation of BERT-like models; from-scratch pre-training does not match the performance of further pre-training an existing checkpoint.

What carries the argument

Continued pre-training via the masked language modeling objective on the full set of US court opinions, which adapts representations for legal content while preserving the model's native capacity for sequences up to 8,192 tokens.

If this is right

Significant improvements over vanilla ModernBERT on every tested dataset tied to US court opinions.
Gains of the same magnitude as those found in early domain-adaptation experiments with BERT-like models.
From-scratch pre-training on the same legal corpus underperforms adaptation of an existing ModernBERT checkpoint.
The adapted models support computation of embeddings for legal passages and rapid reranking of hundreds of passages per query.
All resulting checkpoints are released for public use.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same continued-pre-training recipe could be tested on other narrow domains such as medicine or finance.
The 8k-token context length opens the possibility of processing entire long legal documents in a single pass rather than chunking.
Efficiency comparisons between adapting existing large models versus training new ones from scratch become relevant for resource planning.
Downstream legal applications may now incorporate these embeddings for retrieval or classification without additional fine-tuning data.

Load-bearing premise

That masked language modeling on court opinions will produce better representations for downstream legal tasks.

What would settle it

No measurable improvement on the US court opinion datasets, or equivalent results from a from-scratch model, would falsify the benefit of further pre-training.

Figures

Figures reproduced from arXiv: 2606.28538 by Dominik Stammbach, Peter Henderson.

**Figure 1.** Figure 1: Loss curves during pre-training: number of steps on the x-axis, loss on the y-axis. Top row displays curves initialized [PITH_FULL_IMAGE:figures/full_fig_p006_1.png] view at source ↗

read the original abstract

We investigate domain adaptation of modern BERT models in the legal domain. We further pre-train ModernBERT on all US court opinions using the masked language modeling objective. Although ModernBERT has been trained on roughly 500x more data than original BERT, we still find that this model benefits from further pre-training and domain adaptation in the legal domain: we report significant improvements compared to vanilla ModernBERT on all datasets connected to US court opinions. We find gains similar to those reported in early work on domain adaptation of BERT-like models. However, from scratch pre-training does not match the performance of further pre-training an existing ModernBERT checkpoint in our experiments. The resulting models are capable of processing sequences up to 8,192 tokens, and can be used to compute meaningful embeddings of legal passages, or could quickly rerank hundreds of legal passages for a given search query. We release all model checkpoints publicly.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

ModernBERT still gains from legal-domain continued pre-training, but missing scale, step counts, and controls make it hard to credit the domain rather than extra compute.

read the letter

The paper runs masked LM continued pre-training on US court opinions starting from a ModernBERT checkpoint and reports gains on legal tasks that line up with the size of improvements seen in the original BERT domain-adaptation papers. Adapting the existing model beats training from scratch on the same data, and the adapted checkpoints are released with 8k-token support.

Releasing the models is the clearest practical contribution. Anyone who needs embeddings or reranking over long legal passages can download and use them without repeating the run.

The soft spot is the lack of experimental controls. The abstract gives no corpus size for the court opinions, no token or step count for the adaptation phase, no learning-rate or batch details, and no ablation that holds total training budget fixed while changing only the data domain. Without those numbers it is difficult to separate domain adaptation from simply running more optimization steps. The from-scratch comparison is stated but not shown to have used equivalent resources, so that result is also hard to read.

The work is incremental rather than a new method. It confirms that the old trick still works on a newer base model, which is useful to know but not surprising.

This is mainly for practitioners who want off-the-shelf legal models or for groups that need long-context legal embeddings. Researchers focused on adaptation techniques will find it confirmatory at best. It deserves peer review if the full version supplies the missing training details and proper ablations; otherwise the central claim stays under-supported.

Referee Report

3 major / 2 minor

Summary. The manuscript claims that further pre-training ModernBERT on US court opinions via masked language modeling produces significant gains over the vanilla checkpoint on legal downstream tasks, with gains comparable to early BERT domain-adaptation results; from-scratch pre-training underperforms continued pre-training; the resulting models support 8192-token sequences and are released publicly for embedding and reranking use cases.

Significance. If the empirical claims hold after details are supplied, the work would show that domain adaptation can still be useful even after 500x more pre-training data than original BERT, with direct utility for legal NLP. The public release of checkpoints is a clear strength for reproducibility.

major comments (3)

[Abstract] Abstract: the central claim of 'significant improvements compared to vanilla ModernBERT on all datasets connected to US court opinions' is unsupported by any reported corpus size, token count, training steps, learning-rate schedule, batch size, statistical tests, or variance across runs, making it impossible to verify whether domain adaptation (rather than extra compute) is responsible.
[Abstract] Abstract: the statement that 'from scratch pre-training does not match the performance of further pre-training' lacks confirmation that equivalent total training budget, optimizer settings, or data volume were used in the from-scratch run, which is load-bearing for the claim that continued pre-training of an existing checkpoint is preferable.
[Abstract] Abstract: no ablation or control experiment is described that holds total compute fixed while varying only domain specificity, leaving open the possibility that observed gains are explained by optimization differences or additional steps rather than the legal corpus.

minor comments (2)

The phrase 'all datasets connected to US court opinions' is vague; explicit dataset names, sizes, and task formulations would improve clarity.
Consider adding a results table with exact metrics, baselines, and significance markers to replace the qualitative 'significant improvements' phrasing.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the detailed feedback on the need for greater experimental transparency. We will revise the manuscript to expand the abstract and add a methods section with the requested details on training configuration, corpus statistics, and controls.

read point-by-point responses

Referee: [Abstract] Abstract: the central claim of 'significant improvements compared to vanilla ModernBERT on all datasets connected to US court opinions' is unsupported by any reported corpus size, token count, training steps, learning-rate schedule, batch size, statistical tests, or variance across runs, making it impossible to verify whether domain adaptation (rather than extra compute) is responsible.

Authors: We agree that the abstract does not currently report these details. In the revision we will add the US court opinions corpus size in tokens, total training steps, learning-rate schedule, batch size, results with standard deviation across runs, and statistical significance tests. These additions will allow readers to assess whether the gains stem from domain adaptation rather than extra compute. revision: yes
Referee: [Abstract] Abstract: the statement that 'from scratch pre-training does not match the performance of further pre-training' lacks confirmation that equivalent total training budget, optimizer settings, or data volume were used in the from-scratch run, which is load-bearing for the claim that continued pre-training of an existing checkpoint is preferable.

Authors: The referee is correct that the abstract omits explicit confirmation of matched budgets. We will revise the text to state that the from-scratch run used an equivalent total training budget, identical optimizer settings, and the same data volume as the continued pre-training run; full hyperparameter tables will be added to the methods section. revision: yes
Referee: [Abstract] Abstract: no ablation or control experiment is described that holds total compute fixed while varying only domain specificity, leaving open the possibility that observed gains are explained by optimization differences or additional steps rather than the legal corpus.

Authors: We acknowledge the value of an explicit ablation that isolates domain specificity while holding total compute constant. Our existing from-scratch versus continued-pretraining comparison already matches total compute and data volume; we will expand the discussion to clarify this control and explicitly note the remaining limitations. If space allows we will add a short additional control experiment. revision: partial

Circularity Check

0 steps flagged

No circularity: empirical domain-adaptation results rest on measured performance, not derivations or self-referential fits

full rationale

The paper is an empirical report of continued pre-training of ModernBERT on US court opinions via masked language modeling, followed by evaluation on legal datasets. No equations, uniqueness theorems, or first-principles derivations are present; performance gains are stated as direct experimental measurements. The central claim (further pre-training yields improvements) is not reduced to fitted parameters by construction, nor does it rely on self-citation chains for load-bearing premises. This matches the default case of a self-contained empirical study with no circular steps.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Abstract-only review provides no explicit free parameters, axioms, or invented entities; the work implicitly relies on the standard assumption that masked language modeling pre-training improves downstream task performance.

axioms (1)

domain assumption Masked language modeling pre-training on domain-specific text improves model representations for downstream tasks in that domain
Invoked by the decision to further pre-train with MLM and the expectation of downstream gains; standard in BERT literature but not proven in the abstract.

pith-pipeline@v0.9.1-grok · 5670 in / 1234 out tokens · 27061 ms · 2026-06-30T01:21:07.802182+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

39 extracted references · 10 canonical work pages

[1]

Abubakar Abid, Maheen Farooqi, and James Zou. 2021. Persistent anti-muslim bias in large language models. InProceedings of the 2021 AAAI/ACM Conference on AI, Ethics, and Society. 298–306. https://dl.acm.org/doi/abs/10.1145/3461702. 3462624

work page doi:10.1145/3461702 2021
[2]

Elliott Ash, Aniket Kesari, Suresh Naidu, Lena Song, and Dominik Stammbach
[3]

InProceedings of the Symposium on Computer Science and Law(Boston, MA, USA)(CSLA W ’24)

Translating Legalese: Enhancing Public Understanding of Court Opinions with Legal Summarizers. InProceedings of the Symposium on Computer Science and Law(Boston, MA, USA)(CSLA W ’24). Association for Computing Machinery, New York, NY, USA, 136–157. doi:10.1145/3614407.3643700

work page doi:10.1145/3614407.3643700
[4]

Iz Beltagy, Kyle Lo, and Arman Cohan. 2019. SciBERT: A Pretrained Language Model for Scientific Text. InProceedings of the 2019 Conference on Empirical Meth- ods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Kentaro Inui, Jing Jiang, Vincent Ng, and Xiaojun Wan (Eds.). Association ...

work page doi:10.18653/v1/d19-1371 2019
[5]

Ilias Chalkidis, Manos Fergadiotis, Prodromos Malakasiotis, Nikolaos Aletras, and Ion Androutsopoulos. 2020. LEGAL-BERT: The Muppets straight out of Law School. InFindings of the Association for Computational Linguistics: EMNLP 2020, Trevor Cohn, Yulan He, and Yang Liu (Eds.). Association for Computational Linguistics, Online, 2898–2904. doi:10.18653/v1/2...

work page doi:10.18653/v1/2020.findings-emnlp.261 2020
[6]

Ilias Chalkidis, Nicolas Garneau, Catalina Goanta, Daniel Katz, and Anders Søgaard. 2023. LeXFiles and LegalLAMA: Facilitating English Multinational Legal Language Model Development. InProceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Computational Linguistics, Toronto, Canada,...

2023
[7]

Ilias Chalkidis, Abhik Jana, Dirk Hartung, Michael Bommarito, Ion Androut- sopoulos, Daniel Katz, and Nikolaos Aletras. 2022. LexGLUE: A Benchmark Dataset for Legal Language Understanding in English. InProceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Smaranda Muresan, Preslav Nakov, and Aline...

work page doi:10.18653/v1/2022.acl-long.297 2022
[8]

Inyoung Cheong, Patty Liu, Dominik Stammbach, and Peter Henderson. 2026. How Can AI Augment Access to Justice? Public Defenders’ Perspectives on AI Adoption. arXiv:2510.22933 [cs.CY] https://arxiv.org/abs/2510.22933

Pith/arXiv arXiv 2026
[9]

Inyoung Cheong, King Xia, KJ Kevin Feng, Quan Ze Chen, and Amy X Zhang
[10]

InProceedings of the 2024 ACM Conference on Fairness, Accountability, and Transparency

(A) I am not a lawyer, but...: engaging legal experts towards responsible LLM policies for legal advice. InProceedings of the 2024 ACM Conference on Fairness, Accountability, and Transparency. 2454–2469

2024
[11]

Tri Dao. 2023. FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning. arXiv:2307.08691 [cs.LG] https://arxiv.org/abs/2307.08691

Pith/arXiv arXiv 2023
[12]

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Jill Burstein, Christy...

2019
[13]

Ricardo Dominguez-Olmedo, Vedant Nanda, Rediet Abebe, Stefan Bechtold, Christoph Engel, Jens Frankenreiter, Krishna Gummadi, Moritz Hardt, and Michael Livermore. 2024. Lawma: The Power of Specialization for Legal Tasks. arXiv:2407.16615 [cs.CL] https://arxiv.org/abs/2407.16615

arXiv 2024
[14]

Aran Komatsuzaki Enrico Shippole. 2024. Cleaned Caselaw Access Project. https://huggingface.co/datasets/TeraflopAI/Caselaw_Access_Project

2024
[15]

Adam Feldman. 2018. Empirical SCOTUS: An opinion is worth at least a thousand words (Corrected). https://www.scotusblog.com/2018/04/empirical-scotus-an- opinion-is-worth-at-least-a-thousand-words/ SCOTUSblog, Accessed: 2025-02- 01

2018
[16]

Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, and et al. 2024. The Llama 3 Herd of Models. arXiv:2407.21783 [cs.AI] https://arxiv.org/abs/2407.21783

Pith/arXiv arXiv 2024
[17]

Krass, Lucia Zheng, Neel Guha, Christopher D

Peter Henderson, Mark S. Krass, Lucia Zheng, Neel Guha, Christopher D. Man- ning, Dan Jurafsky, and Daniel E. Ho. 2022. Pile of Law: Learning Respon- sible Data Filtering from the Law and a 256GB Open-Source Legal Dataset. https://arxiv.org/abs/2207.00220

arXiv 2022
[18]

Harvard Library Innovation Lab. 2024. Cold Cases Dataset. https://huggingface. co/datasets/harvard-lil/cold-cases Accessed: 2025-02-01

2024
[19]

Li Lucy and David Bamman. 2021. Gender and representation bias in GPT-3 gen- erated stories. InProceedings of the Third Workshop on Narrative Understanding. 48–55. https://aclanthology.org/2021.nuse-1.5/

2021
[20]

Robert Mahari, Dominik Stammbach, Elliott Ash, and Alex Pentland. 2023. The Law and NLP: Bridging Disciplinary Disconnects. InFindings of the Association for Computational Linguistics: EMNLP 2023, Houda Bouamor, Juan Pino, and Kalika Bali (Eds.). Association for Computational Linguistics, Singapore, 3445–

2023
[21]

doi:10.18653/v1/2023.findings-emnlp.224

work page doi:10.18653/v1/2023.findings-emnlp.224 2023
[22]

Robert Mahari, Dominik Stammbach, Elliott Ash, and Alex Pentland. 2024. LePaRD: A Large-Scale Dataset of Judicial Citations to Precedent. InProceed- ings of the 62nd Annual Meeting of the Association for Computational Linguis- tics (Volume 1: Long Papers), Lun-Wei Ku, Andre Martins, and Vivek Srikumar (Eds.). Association for Computational Linguistics, Ban...

work page doi:10.18653/v1/2024.acl-long.532 2024
[23]

Joel Niklaus, Veton Matoshi, Matthias Stürmer, Ilias Chalkidis, and Daniel E. Ho. 2024. MultiLegalPile: A 689GB Multilingual Legal Corpus. arXiv:2306.02069 [cs.CL] https://arxiv.org/abs/2306.02069

arXiv 2024
[24]

McCarthy, Christopher Hahn, Brian M

Joel Niklaus, Lucia Zheng, Arya D. McCarthy, Christopher Hahn, Brian M. Rosen, Peter Henderson, Daniel E. Ho, Garrett Honke, Percy Liang, and Christopher Man- ning. 2025. LawInstruct: A Resource for Studying Language Model Adaptation to the Legal Domain. arXiv:2404.02127 [cs.CL] https://arxiv.org/abs/2404.02127

arXiv 2025
[25]

Travisano

Adam Paine and Robert M. Travisano. 2025. Discovery Pitfalls in the Age of AI. https://techcrunch.com/2025/07/25/sam-altman-warns-theres-no-legal- confidentiality-when-using-chatgpt-as-a-therapist/.Epstein Becker Green(Sep- tember 2025)

2025
[26]

Jacob Portes, Alex Trott, Sam Havens, Daniel King, Abhinav Venigalla, Moin Nadeem, Nikhil Sardana, Daya Khudia, and Jonathan Frankle. 2023. MosaicBERT: a bidirectional encoder optimized for fast pretraining. InProceedings of the 37th International Conference on Neural Information Processing Systems(New Orleans, LA, USA)(NIPS ’23). Curran Associates Inc., ...

2023
[27]

Nils Reimers and Iryna Gurevych. 2019. Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks. InProceedings of the 2019 Conference on Em- pirical Methods in Natural Language Processing. Association for Computational Linguistics. https://arxiv.org/abs/1908.10084

Pith/arXiv arXiv 2019
[28]

Spaeth, Lee Epstein, Jeffrey A

Harold J. Spaeth, Lee Epstein, Jeffrey A. Segal, Andrew D. Martin, Theodore J. Ruger, and Sara C. Benesh. 2020. Supreme Court Database, Version 2020 Release

2020
[29]

Accessed: 2025-02-01

Washington University Law. Accessed: 2025-02-01

2025
[30]

Dominik Stammbach, Kylie Zhang, Patty Liu, Nimra Nadeem, Inyoung Cheong, Lucia Zheng, and Peter Henderson. 2026. Legal Retrieval for Public Defenders. arXiv:2601.14348 [cs.IR] https://arxiv.org/abs/2601.14348

Pith/arXiv arXiv 2026
[31]

Jianlin Su, Murtadha Ahmed, Yu Lu, Shengfeng Pan, Wen Bo, and Yunfeng Liu. 2024. RoFormer: Enhanced transformer with Rotary Position Embedding. Neurocomputing568 (2024), 127063. doi:10.1016/j.neucom.2023.127063

work page doi:10.1016/j.neucom.2023.127063 2024
[32]

Tran, Mostafa Dehghani, Jianmo Ni, Dara Bahri, Harsh Mehta, Zhen Qin, Kai Hui, Zhe Zhao, Jai Gupta, Tal Schuster, William W

Yi Tay, Vinh Q. Tran, Mostafa Dehghani, Jianmo Ni, Dara Bahri, Harsh Mehta, Zhen Qin, Kai Hui, Zhe Zhao, Jai Gupta, Tal Schuster, William W. Cohen, and Donald Metzler. 2022. Transformer Memory as a Differentiable Search Index. arXiv:2202.06991 [cs.CL] https://arxiv.org/abs/2202.06991

arXiv 2022
[33]

Benjamin Warner, Antoine Chaffin, Benjamin Clavié, Orion Weller, Oskar Hall- ström, Said Taghadouini, Alexis Gallagher, Raja Biswas, Faisal Ladhak, Tom Aarsen, Nathan Cooper, Griffin Adams, Jeremy Howard, and Iacopo Poli. 2024. Smarter, Better, Faster, Longer: A Modern Bidirectional Encoder for Fast, Memory Efficient, and Long Context Finetuning and Infer...

Pith/arXiv arXiv 2024
[34]

Nicolas Webersinke, Mathias Kraus, Julia Anna Bingler, and Markus Leippold
[35]

arXiv:2110.12010 [cs.CL] https://arxiv.org/abs/2110.12010

ClimateBert: A Pretrained Language Model for Climate-Related Text. arXiv:2110.12010 [cs.CL] https://arxiv.org/abs/2110.12010

arXiv
[36]

Anderson, Peter Henderson, and Daniel E

Lucia Zheng, Neel Guha, Brandon R. Anderson, Peter Henderson, and Daniel E. Ho. 2021. When Does Pretraining Help? Assessing Self-Supervised Learning for Law and the CaseHOLD Dataset. arXiv:2104.08671 [cs.CL] https://arxiv.org/abs/ 2104.08671

arXiv 2021
[37]

Manning, Peter Henderson, and Daniel E

Lucia Zheng, Neel Guha, Javokhir Arifov, Sarah Zhang, Michal Skreta, Christo- pher D. Manning, Peter Henderson, and Daniel E. Ho. 2025. A Reasoning-Focused Legal Retrieval Benchmark. InProceedings of the 2025 Symposium on Computer Science and Law(Munich, Germany)(CSLA W ’25). Association for Computing Machinery, New York, NY, USA, 169–193. doi:10.1145/370...

work page doi:10.1145/3709025.3712219 2025
[38]

Haoxi Zhong, Chaojun Xiao, Cunchao Tu, Tianyang Zhang, Zhiyuan Liu, and Maosong Sun. 2020. How Does NLP Benefit Legal System: A Summary of Legal Artificial Intelligence. InProceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Dan Jurafsky, Joyce Chai, Natalie Schluter, and Joel Tetreault (Eds.). Association for Computati...

work page doi:10.18653/v1/2020.acl-main.466 2020
[39]

Liu Zhuang, Lin Wayne, Shi Ya, and Zhao Jun. 2021. A Robustly Optimized BERT Pre-training Approach with Post-training. InProceedings of the 20th Chinese National Conference on Computational Linguistics, Sheng Li, Maosong Sun, Yang Liu, Hua Wu, Kang Liu, Wanxiang Che, Shizhu He, and Gaoqi Rao (Eds.). Chinese Information Processing Society of China, Huhhot,...

2021

[1] [1]

Abubakar Abid, Maheen Farooqi, and James Zou. 2021. Persistent anti-muslim bias in large language models. InProceedings of the 2021 AAAI/ACM Conference on AI, Ethics, and Society. 298–306. https://dl.acm.org/doi/abs/10.1145/3461702. 3462624

work page doi:10.1145/3461702 2021

[2] [2]

Elliott Ash, Aniket Kesari, Suresh Naidu, Lena Song, and Dominik Stammbach

[3] [3]

InProceedings of the Symposium on Computer Science and Law(Boston, MA, USA)(CSLA W ’24)

Translating Legalese: Enhancing Public Understanding of Court Opinions with Legal Summarizers. InProceedings of the Symposium on Computer Science and Law(Boston, MA, USA)(CSLA W ’24). Association for Computing Machinery, New York, NY, USA, 136–157. doi:10.1145/3614407.3643700

work page doi:10.1145/3614407.3643700

[4] [4]

Iz Beltagy, Kyle Lo, and Arman Cohan. 2019. SciBERT: A Pretrained Language Model for Scientific Text. InProceedings of the 2019 Conference on Empirical Meth- ods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Kentaro Inui, Jing Jiang, Vincent Ng, and Xiaojun Wan (Eds.). Association ...

work page doi:10.18653/v1/d19-1371 2019

[5] [5]

Ilias Chalkidis, Manos Fergadiotis, Prodromos Malakasiotis, Nikolaos Aletras, and Ion Androutsopoulos. 2020. LEGAL-BERT: The Muppets straight out of Law School. InFindings of the Association for Computational Linguistics: EMNLP 2020, Trevor Cohn, Yulan He, and Yang Liu (Eds.). Association for Computational Linguistics, Online, 2898–2904. doi:10.18653/v1/2...

work page doi:10.18653/v1/2020.findings-emnlp.261 2020

[6] [6]

Ilias Chalkidis, Nicolas Garneau, Catalina Goanta, Daniel Katz, and Anders Søgaard. 2023. LeXFiles and LegalLAMA: Facilitating English Multinational Legal Language Model Development. InProceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Computational Linguistics, Toronto, Canada,...

2023

[7] [7]

Ilias Chalkidis, Abhik Jana, Dirk Hartung, Michael Bommarito, Ion Androut- sopoulos, Daniel Katz, and Nikolaos Aletras. 2022. LexGLUE: A Benchmark Dataset for Legal Language Understanding in English. InProceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Smaranda Muresan, Preslav Nakov, and Aline...

work page doi:10.18653/v1/2022.acl-long.297 2022

[8] [8]

Inyoung Cheong, Patty Liu, Dominik Stammbach, and Peter Henderson. 2026. How Can AI Augment Access to Justice? Public Defenders’ Perspectives on AI Adoption. arXiv:2510.22933 [cs.CY] https://arxiv.org/abs/2510.22933

Pith/arXiv arXiv 2026

[9] [9]

Inyoung Cheong, King Xia, KJ Kevin Feng, Quan Ze Chen, and Amy X Zhang

[10] [10]

InProceedings of the 2024 ACM Conference on Fairness, Accountability, and Transparency

(A) I am not a lawyer, but...: engaging legal experts towards responsible LLM policies for legal advice. InProceedings of the 2024 ACM Conference on Fairness, Accountability, and Transparency. 2454–2469

2024

[11] [11]

Tri Dao. 2023. FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning. arXiv:2307.08691 [cs.LG] https://arxiv.org/abs/2307.08691

Pith/arXiv arXiv 2023

[12] [12]

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Jill Burstein, Christy...

2019

[13] [13]

Ricardo Dominguez-Olmedo, Vedant Nanda, Rediet Abebe, Stefan Bechtold, Christoph Engel, Jens Frankenreiter, Krishna Gummadi, Moritz Hardt, and Michael Livermore. 2024. Lawma: The Power of Specialization for Legal Tasks. arXiv:2407.16615 [cs.CL] https://arxiv.org/abs/2407.16615

arXiv 2024

[14] [14]

Aran Komatsuzaki Enrico Shippole. 2024. Cleaned Caselaw Access Project. https://huggingface.co/datasets/TeraflopAI/Caselaw_Access_Project

2024

[15] [15]

Adam Feldman. 2018. Empirical SCOTUS: An opinion is worth at least a thousand words (Corrected). https://www.scotusblog.com/2018/04/empirical-scotus-an- opinion-is-worth-at-least-a-thousand-words/ SCOTUSblog, Accessed: 2025-02- 01

2018

[16] [16]

Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, and et al. 2024. The Llama 3 Herd of Models. arXiv:2407.21783 [cs.AI] https://arxiv.org/abs/2407.21783

Pith/arXiv arXiv 2024

[17] [17]

Krass, Lucia Zheng, Neel Guha, Christopher D

Peter Henderson, Mark S. Krass, Lucia Zheng, Neel Guha, Christopher D. Man- ning, Dan Jurafsky, and Daniel E. Ho. 2022. Pile of Law: Learning Respon- sible Data Filtering from the Law and a 256GB Open-Source Legal Dataset. https://arxiv.org/abs/2207.00220

arXiv 2022

[18] [18]

Harvard Library Innovation Lab. 2024. Cold Cases Dataset. https://huggingface. co/datasets/harvard-lil/cold-cases Accessed: 2025-02-01

2024

[19] [19]

Li Lucy and David Bamman. 2021. Gender and representation bias in GPT-3 gen- erated stories. InProceedings of the Third Workshop on Narrative Understanding. 48–55. https://aclanthology.org/2021.nuse-1.5/

2021

[20] [20]

Robert Mahari, Dominik Stammbach, Elliott Ash, and Alex Pentland. 2023. The Law and NLP: Bridging Disciplinary Disconnects. InFindings of the Association for Computational Linguistics: EMNLP 2023, Houda Bouamor, Juan Pino, and Kalika Bali (Eds.). Association for Computational Linguistics, Singapore, 3445–

2023

[21] [21]

doi:10.18653/v1/2023.findings-emnlp.224

work page doi:10.18653/v1/2023.findings-emnlp.224 2023

[22] [22]

Robert Mahari, Dominik Stammbach, Elliott Ash, and Alex Pentland. 2024. LePaRD: A Large-Scale Dataset of Judicial Citations to Precedent. InProceed- ings of the 62nd Annual Meeting of the Association for Computational Linguis- tics (Volume 1: Long Papers), Lun-Wei Ku, Andre Martins, and Vivek Srikumar (Eds.). Association for Computational Linguistics, Ban...

work page doi:10.18653/v1/2024.acl-long.532 2024

[23] [23]

Joel Niklaus, Veton Matoshi, Matthias Stürmer, Ilias Chalkidis, and Daniel E. Ho. 2024. MultiLegalPile: A 689GB Multilingual Legal Corpus. arXiv:2306.02069 [cs.CL] https://arxiv.org/abs/2306.02069

arXiv 2024

[24] [24]

McCarthy, Christopher Hahn, Brian M

Joel Niklaus, Lucia Zheng, Arya D. McCarthy, Christopher Hahn, Brian M. Rosen, Peter Henderson, Daniel E. Ho, Garrett Honke, Percy Liang, and Christopher Man- ning. 2025. LawInstruct: A Resource for Studying Language Model Adaptation to the Legal Domain. arXiv:2404.02127 [cs.CL] https://arxiv.org/abs/2404.02127

arXiv 2025

[25] [25]

Travisano

Adam Paine and Robert M. Travisano. 2025. Discovery Pitfalls in the Age of AI. https://techcrunch.com/2025/07/25/sam-altman-warns-theres-no-legal- confidentiality-when-using-chatgpt-as-a-therapist/.Epstein Becker Green(Sep- tember 2025)

2025

[26] [26]

Jacob Portes, Alex Trott, Sam Havens, Daniel King, Abhinav Venigalla, Moin Nadeem, Nikhil Sardana, Daya Khudia, and Jonathan Frankle. 2023. MosaicBERT: a bidirectional encoder optimized for fast pretraining. InProceedings of the 37th International Conference on Neural Information Processing Systems(New Orleans, LA, USA)(NIPS ’23). Curran Associates Inc., ...

2023

[27] [27]

Nils Reimers and Iryna Gurevych. 2019. Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks. InProceedings of the 2019 Conference on Em- pirical Methods in Natural Language Processing. Association for Computational Linguistics. https://arxiv.org/abs/1908.10084

Pith/arXiv arXiv 2019

[28] [28]

Spaeth, Lee Epstein, Jeffrey A

Harold J. Spaeth, Lee Epstein, Jeffrey A. Segal, Andrew D. Martin, Theodore J. Ruger, and Sara C. Benesh. 2020. Supreme Court Database, Version 2020 Release

2020

[29] [29]

Accessed: 2025-02-01

Washington University Law. Accessed: 2025-02-01

2025

[30] [30]

Dominik Stammbach, Kylie Zhang, Patty Liu, Nimra Nadeem, Inyoung Cheong, Lucia Zheng, and Peter Henderson. 2026. Legal Retrieval for Public Defenders. arXiv:2601.14348 [cs.IR] https://arxiv.org/abs/2601.14348

Pith/arXiv arXiv 2026

[31] [31]

Jianlin Su, Murtadha Ahmed, Yu Lu, Shengfeng Pan, Wen Bo, and Yunfeng Liu. 2024. RoFormer: Enhanced transformer with Rotary Position Embedding. Neurocomputing568 (2024), 127063. doi:10.1016/j.neucom.2023.127063

work page doi:10.1016/j.neucom.2023.127063 2024

[32] [32]

Tran, Mostafa Dehghani, Jianmo Ni, Dara Bahri, Harsh Mehta, Zhen Qin, Kai Hui, Zhe Zhao, Jai Gupta, Tal Schuster, William W

Yi Tay, Vinh Q. Tran, Mostafa Dehghani, Jianmo Ni, Dara Bahri, Harsh Mehta, Zhen Qin, Kai Hui, Zhe Zhao, Jai Gupta, Tal Schuster, William W. Cohen, and Donald Metzler. 2022. Transformer Memory as a Differentiable Search Index. arXiv:2202.06991 [cs.CL] https://arxiv.org/abs/2202.06991

arXiv 2022

[33] [33]

Benjamin Warner, Antoine Chaffin, Benjamin Clavié, Orion Weller, Oskar Hall- ström, Said Taghadouini, Alexis Gallagher, Raja Biswas, Faisal Ladhak, Tom Aarsen, Nathan Cooper, Griffin Adams, Jeremy Howard, and Iacopo Poli. 2024. Smarter, Better, Faster, Longer: A Modern Bidirectional Encoder for Fast, Memory Efficient, and Long Context Finetuning and Infer...

Pith/arXiv arXiv 2024

[34] [34]

Nicolas Webersinke, Mathias Kraus, Julia Anna Bingler, and Markus Leippold

[35] [35]

arXiv:2110.12010 [cs.CL] https://arxiv.org/abs/2110.12010

ClimateBert: A Pretrained Language Model for Climate-Related Text. arXiv:2110.12010 [cs.CL] https://arxiv.org/abs/2110.12010

arXiv

[36] [36]

Anderson, Peter Henderson, and Daniel E

Lucia Zheng, Neel Guha, Brandon R. Anderson, Peter Henderson, and Daniel E. Ho. 2021. When Does Pretraining Help? Assessing Self-Supervised Learning for Law and the CaseHOLD Dataset. arXiv:2104.08671 [cs.CL] https://arxiv.org/abs/ 2104.08671

arXiv 2021

[37] [37]

Manning, Peter Henderson, and Daniel E

Lucia Zheng, Neel Guha, Javokhir Arifov, Sarah Zhang, Michal Skreta, Christo- pher D. Manning, Peter Henderson, and Daniel E. Ho. 2025. A Reasoning-Focused Legal Retrieval Benchmark. InProceedings of the 2025 Symposium on Computer Science and Law(Munich, Germany)(CSLA W ’25). Association for Computing Machinery, New York, NY, USA, 169–193. doi:10.1145/370...

work page doi:10.1145/3709025.3712219 2025

[38] [38]

Haoxi Zhong, Chaojun Xiao, Cunchao Tu, Tianyang Zhang, Zhiyuan Liu, and Maosong Sun. 2020. How Does NLP Benefit Legal System: A Summary of Legal Artificial Intelligence. InProceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Dan Jurafsky, Joyce Chai, Natalie Schluter, and Joel Tetreault (Eds.). Association for Computati...

work page doi:10.18653/v1/2020.acl-main.466 2020

[39] [39]

Liu Zhuang, Lin Wayne, Shi Ya, and Zhao Jun. 2021. A Robustly Optimized BERT Pre-training Approach with Post-training. InProceedings of the 20th Chinese National Conference on Computational Linguistics, Sheng Li, Maosong Sun, Yang Liu, Hua Wu, Kang Liu, Wanxiang Che, Shizhu He, and Gaoqi Rao (Eds.). Chinese Information Processing Society of China, Huhhot,...

2021