Legal Retrieval for Public Defenders

Dominik Stammbach; Inyoung Cheong; Kylie Zhang; Lucia Zheng; Nimra Nadeem; Patty Liu; Peter Henderson

arxiv: 2601.14348 · v3 · pith:NRPEBZGRnew · submitted 2026-01-20 · 💻 cs.IR

Legal Retrieval for Public Defenders

Dominik Stammbach , Kylie Zhang , Patty Liu , Nimra Nadeem , Inyoung Cheong , Lucia Zheng , Peter Henderson This is my paper

Pith reviewed 2026-05-21 15:02 UTC · model grok-4.3

classification 💻 cs.IR

keywords legal retrievalpublic defendersinformation retrievalquery expansionlegal AIdomain adaptationbenchmarksappellate briefs

0 comments

The pith

Domain knowledge adaptations improve retrieval quality for public defense legal research where standard benchmarks fail.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper creates the NJ BriefBank, a tool to retrieve relevant appellate briefs for public defenders to ease their heavy research workload. It finds that off-the-shelf retrieval benchmarks do not perform well on this real-world task. Adding specific domain knowledge such as expanding queries with legal reasoning, using public defense data, and generating synthetic examples raises the quality of results. The authors release a taxonomy of typical defender queries along with a manually annotated test set that matches expert defender judgments closely. This provides a practical example of tailoring AI retrieval to support constitutional legal aid under resource constraints.

Core claim

We show that existing retrieval benchmarks fail to transfer to real public defense research, however adding domain knowledge improves retrieval quality. This includes query expansion with legal reasoning, domain-specific data and curated synthetic examples. To facilitate further research, we release a taxonomy of realistic defender search queries and a manually annotated evaluation dataset for public defense retrieval. This benchmark is highly correlated with a proprietary retrieval dataset annotated by experienced public defenders.

What carries the argument

NJ BriefBank, a domain-adapted retrieval system using query expansion with legal reasoning and synthetic data curation to surface appellate briefs.

If this is right

Retrieval systems for legal professionals require domain-specific tuning rather than relying on general benchmarks.
Public defenders can access more relevant precedents faster, potentially improving case preparation efficiency.
Releasing annotated datasets and query taxonomies allows the community to build better legal AI tools.
AI assistance in public defense can help address constitutional rights to counsel amid high caseloads.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Applying similar adaptations could benefit retrieval tasks in other specialized legal fields like criminal appeals or civil rights.
Live deployment and user studies with defenders would test if improved retrieval leads to actual time savings or better outcomes.
The method of combining legal reasoning in queries might extend to other knowledge-intensive professional search domains.
Integration with writing assistance tools could further reduce the burden on under-resourced public defense offices.

Load-bearing premise

The manually annotated evaluation dataset and its correlation with defender annotations fully represent the day-to-day search needs of public defenders.

What would settle it

Collecting a fresh set of queries and relevance judgments from active public defenders and finding that the domain knowledge methods show no improvement over baselines would falsify the central claim.

read the original abstract

AI tools are suggested as solutions to assist public agencies with heavy workloads. In public defense -- where a constitutional right to counsel meets the complexities of law, overwhelming caseloads, and constrained resources -- practitioners face especially taxing conditions. Yet, there is little evidence of how AI could meaningfully support defenders' day-to-day work. In partnership with the New Jersey Office of the Public Defender, we develop the NJ BriefBank, a retrieval tool which surfaces relevant appellate briefs to streamline legal research and writing. We show that existing retrieval benchmarks fail to transfer to real public defense research, however adding domain knowledge improves retrieval quality. This includes query expansion with legal reasoning, domain-specific data and curated synthetic examples. To facilitate further research, we release a taxonomy of realistic defender search queries and a manually annotated evaluation dataset for public defense retrieval. This benchmark is highly correlated with a proprietary retrieval dataset annotated by experienced public defenders. Our work improves on the status quo of realistic legal retrieval benchmarking and illustrates one approach to applying AI in a real-world public interest setting.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper builds a practical retrieval benchmark and tool for public defenders with a data release, but the quantitative evidence for domain-knowledge gains is still thin.

read the letter

The main point is that this work creates a retrieval system for New Jersey public defenders and releases a query taxonomy plus an annotated benchmark to show where standard methods fall short and where domain tweaks help. They built NJ BriefBank in partnership with the state office to pull relevant appellate briefs, and they report that off-the-shelf benchmarks do not transfer well to defender research tasks. Adding legal reasoning for query expansion, domain-specific data, and synthetic examples improves results. The released taxonomy and manually annotated set correlate well with annotations from practicing defenders, which is a reasonable validation step given access limits. The practical partnership and open resources stand out as useful steps for applied work in a high-stakes, low-resource setting. It is good to see attention on constitutional services like public defense rather than purely academic benchmarks. The soft spots are in the evaluation details. The abstract claims improvements and high correlation but gives no numbers on effect size, no baseline scores, and no ablation results, so the strength of the domain-knowledge argument is hard to judge from what is shown. The stress-test concern about whether the taxonomy and annotations capture deeper workflow elements like precedent chaining or statute mapping is worth checking; if the set mainly reflects surface features, both the negative transfer finding and the positive gains could be tied to this particular data rather than general. This is for people in legal IR or domain adaptation who want examples of professional search benchmarks. A reader focused on applied retrieval in constrained environments would find the resources and approach worth looking at. It has enough grounding in a real deployment and enough open material that it deserves a serious referee to assess the full methods and results sections. I would send it to review.

Referee Report

2 major / 2 minor

Summary. The paper introduces NJ BriefBank, a retrieval system developed with the New Jersey Office of the Public Defender to surface relevant appellate briefs for legal research. It claims that standard IR benchmarks fail to transfer to public-defense workflows, while domain-specific enhancements—query expansion incorporating legal reasoning, domain data, and curated synthetic examples—yield measurable improvements. To support further work, the authors release a taxonomy of realistic defender queries and a manually annotated evaluation dataset that shows high correlation with a proprietary set annotated by experienced public defenders.

Significance. If the evaluation holds, the work offers a concrete demonstration of how IR techniques can be adapted to a high-stakes, resource-constrained public-interest domain and supplies reusable resources (taxonomy and dataset) that could improve benchmarking practices in legal retrieval. The practitioner partnership is a notable strength.

major comments (2)

[Evaluation / Benchmark Construction] The central claims—that existing benchmarks fail to transfer and that domain-knowledge additions improve retrieval—rest on the representativeness of the released taxonomy and manually annotated dataset. The manuscript supports this primarily through reported high correlation with the proprietary defender-annotated set, but does not detail the annotation guidelines, inter-annotator agreement, or coverage of deeper workflow elements such as statute-to-fact mapping and multi-document precedent chaining. Without these, both the negative transfer result and the positive gains risk being artifacts of the evaluation set.
[Results] The abstract and results sections report correlation with expert annotations and overall improvement from domain knowledge, yet provide no quantitative metrics (e.g., nDCG, MAP, or recall@K values), baseline details, or ablation results isolating the contribution of legal-reasoning expansion versus domain data versus synthetic examples. This absence prevents assessment of whether the reported gains are substantial or merely incremental.

minor comments (2)

[Methods] Clarify the exact composition of the synthetic examples and how they were curated to avoid circularity with the evaluation queries.
[Discussion] Add a limitations paragraph discussing potential mismatches between the query taxonomy and the full range of daily defender search tasks.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback. We address each major comment below and outline the revisions we will make to improve the clarity and rigor of the evaluation and results sections.

read point-by-point responses

Referee: [Evaluation / Benchmark Construction] The central claims—that existing benchmarks fail to transfer and that domain-knowledge additions improve retrieval—rest on the representativeness of the released taxonomy and manually annotated dataset. The manuscript supports this primarily through reported high correlation with the proprietary defender-annotated set, but does not detail the annotation guidelines, inter-annotator agreement, or coverage of deeper workflow elements such as statute-to-fact mapping and multi-document precedent chaining. Without these, both the negative transfer result and the positive gains risk being artifacts of the evaluation set.

Authors: We agree that greater transparency regarding the annotation process and taxonomy coverage is needed to substantiate the claims. In the revised manuscript we will add a dedicated subsection that describes the annotation guidelines, reports inter-annotator agreement statistics, and explicitly discusses how the taxonomy and dataset address key workflow elements including statute-to-fact mapping and multi-document precedent chaining. These additions will help demonstrate that the evaluation resources are representative of public-defense practice. revision: yes
Referee: [Results] The abstract and results sections report correlation with expert annotations and overall improvement from domain knowledge, yet provide no quantitative metrics (e.g., nDCG, MAP, or recall@K values), baseline details, or ablation results isolating the contribution of legal-reasoning expansion versus domain data versus synthetic examples. This absence prevents assessment of whether the reported gains are substantial or merely incremental.

Authors: We acknowledge that the current results presentation would benefit from more granular quantitative reporting. We will revise the results section to include specific metrics such as nDCG, MAP, and recall@K, provide full baseline descriptions, and add ablation experiments that isolate the individual contributions of legal-reasoning query expansion, domain-specific data, and synthetic examples. This will enable readers to assess the practical significance of the observed improvements. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical claims rest on released external annotations and correlations rather than self-referential fitting or derivation.

full rationale

The paper reports empirical retrieval experiments on a new public-defense benchmark, shows failure of prior benchmarks to transfer, and demonstrates gains from domain-specific query expansion, data, and synthetic examples. It releases a taxonomy and manually annotated dataset whose correlation with a separate proprietary defender-annotated set is offered as external validation. No equations, predictions, or first-principles derivations are present that reduce to the paper's own inputs by construction. The load-bearing assumption (representativeness of the new benchmark) is acknowledged as an empirical limitation rather than a definitional or self-citation loop. This is a standard non-circular empirical IR paper whose central results are falsifiable against the released data.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The work rests on the assumption that defender-provided annotations and synthetic examples capture the relevant distribution of real queries; no new physical or mathematical entities are introduced.

axioms (1)

domain assumption Standard information retrieval metrics (e.g., precision, recall, or ranking quality) are appropriate proxies for usefulness in legal research.
Invoked when claiming that domain knowledge improves retrieval quality.

pith-pipeline@v0.9.0 · 5716 in / 1171 out tokens · 104202 ms · 2026-05-21T15:02:27.172756+00:00 · methodology

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Controlling Authority Retrieval: A Missing Retrieval Objective for Authority-Governed Knowledge
cs.IR 2026-04 unverdicted novelty 7.0

CAR is a new retrieval objective that targets the currently active authority set rather than most-similar documents, with theorems on coverage conditions and evaluations showing two-stage methods outperform dense retr...

Reference graph

Works this paper leans on

48 extracted references · 48 canonical work pages · cited by 1 Pith paper · 8 internal anchors

[1]

Wainwright, 372 U.S

Gideon v. Wainwright, 372 U.S. 335 (1963), 1963. URLhttps://supreme.justia.com/cases/ federal/us/372/335/

work page 1963
[2]

On the Opportunities and Risks of Foundation Models

Rishi Bommasani, Drew A. Hudson, Ehsan Adeli, Russ Altman, Simran Arora, Sydney von Arx, Michael S. Bernstein, Jeannette Bohg, Antoine Bosselut, Emma Brunskill, Erik Brynjolfs- son, Shyamal Buch, Dallas Card, Rodrigo Castellon, Niladri Chatterji, Annie Chen, Kathleen Creel, Jared Quincy Davis, Dora Demszky, Chris Donahue, Moussa Doumbouya, Esin Dur- mus, ...

work page internal anchor Pith review Pith/arXiv arXiv 2022
[3]

The massive legal embedding benchmark (mleb), 2025

Umar Butler, Abdur-Rahman Butler, and Adrian Lucas Malec. The massive legal embedding benchmark (mleb), 2025. URLhttps://arxiv.org/abs/2510.19365

work page arXiv 2025
[4]

AIR-bench: Automated heterogeneous information retrieval benchmark

Jianlyu Chen, Nan Wang, Chaofan Li, Bo Wang, Shitao Xiao, Han Xiao, Hao Liao, Defu Lian, and Zheng Liu. AIR-bench: Automated heterogeneous information retrieval benchmark. In Wanxiang Che, Joyce Nabende, Ekaterina Shutova, and Mohammad Taher Pilehvar, editors, Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume ...

work page 2025
[5]

How Can AI Augment Access to Justice? Public Defenders' Perspectives on AI Adoption

Inyoung Cheong, Patty Liu, Dominik Stammbach, and Peter Henderson. How can ai augment access to justice? public defenders’ perspectives on ai adoption, 2025. URLhttps://arxiv. org/abs/2510.22933. 15 Legal Retrieval for Public Defenders

work page internal anchor Pith review Pith/arXiv arXiv 2025
[6]

Record clearance at scale: How clear my record helped reduce or dismiss 144,000 convictions in california, 2020

Code for America. Record clearance at scale: How clear my record helped reduce or dismiss 144,000 convictions in california, 2020. URLhttps://codeforamerica.org/news/ record-clearance-at-scale-how-clear-my-record-helped-reduce-or-dismiss-144-000-convictions-in-california/. Accessed: 2025-08-03

work page 2020
[7]

Modeling law search as prediction.Artificial Intelligence and Law, 29(1):3–34, 2021

Faraz Dadgostari, Mauricio Guim, Peter A Beling, Michael A Livermore, and Daniel N Rock- more. Modeling law search as prediction.Artificial Intelligence and Law, 29(1):3–34, 2021

work page 2021
[8]

Mind2Web: Towards a Generalist Agent for the Web

Xiang Deng, Yu Gu, Boyuan Zheng, Shijie Chen, Samuel Stevens, Boshi Wang, Huan Sun, and Yu Su. Mind2web: Towards a generalist agent for the web, 2023. URLhttps://arxiv.org/abs/ 2306.06070

work page internal anchor Pith review Pith/arXiv arXiv 2023
[9]

Lawma: The power of specialization for legal annotation, 2025

Ricardo Dominguez-Olmedo, Vedant Nanda, Rediet Abebe, Stefan Bechtold, Christoph Engel, Jens Frankenreiter, Krishna Gummadi, Moritz Hardt, and Michael Livermore. Lawma: The power of specialization for legal annotation, 2025. URLhttps://arxiv.org/abs/2407.16615

work page arXiv 2025
[10]

Barry C. Edwards. Why appeals courts rarely reverse lower courts: An experimental study to explore affirmation bias.Emory Law Journal Online, 68:1035–1073, 2019. URLhttps: //scholarlycommons.law.emory.edu/elj-online/7

work page 2019
[11]

Scaling deep contrastive learning batch size under memory limited setup, 2021

Luyu Gao, Yunyi Zhang, Jiawei Han, and Jamie Callan. Scaling deep contrastive learning batch size under memory limited setup, 2021. URLhttps://arxiv.org/abs/2101.06983

work page arXiv 2021
[12]

The Llama 3 Herd of Models

Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ah- mad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, Amy Yang, and Zhiyu Ma et al. The llama 3 herd of models, 2024. URLhttps://arxiv.org/abs/2407.21783

work page internal anchor Pith review Pith/arXiv arXiv 2024
[13]

The illusion of readiness: Stress testing large frontier models on multimodal medi- cal benchmarks, 2025

Yu Gu, Jingjing Fu, Xiaodong Liu, Jeya Maria Jose Valanarasu, Noel Codella, Reuben Tan, Qianchu Liu, Ying Jin, Sheng Zhang, Jinyu Wang, Rui Wang, Lei Song, Guanghui Qin, Naoto Usuyama, Cliff Wong, Cheng Hao, Hohin Lee, Praneeth Sanapathi, Sarah Hilado, Bian Jiang, Javier Alvarez-Valle, Mu Wei, Jianfeng Gao, Eric Horvitz, Matt Lungren, Hoifung Poon, and Pa...

work page arXiv 2025
[14]

Ho, Christopher Ré, Adam Chilton, Aditya Narayana, Alex Chohlas-Wood, Austin Peters, Brandon Waldon, Daniel N

Neel Guha, Julian Nyarko, Daniel E. Ho, Christopher Ré, Adam Chilton, Aditya Narayana, Alex Chohlas-Wood, Austin Peters, Brandon Waldon, Daniel N. Rockmore, Diego Zambrano, Dmitry Talisman, Enam Hoque, Faiz Surani, Frank Fagan, Galit Sarfaty, Gregory M. Dickinson, Haggai Porat, Jason Hegland, Jessica Wu, Joe Nudell, Joel Niklaus, John Nay, Jonathan H. Cho...

work page 2023
[15]

CLERC: A dataset for U

Abe Bohan Hou, Orion Weller, Guanghui Qin, Eugene Yang, Dawn Lawrie, Nils Holzenberger, Andrew Blair-Stanek, and Benjamin Van Durme. CLERC: A dataset for U. S. legal case retrieval and retrieval-augmented analysis generation. In Luis Chiruzzo, Alan Ritter, and Lu Wang, ed- itors,Findings of the Association for Computational Linguistics: NAACL 2025, pages ...

work page doi:10.18653/v1/2025.findings-naacl.441 2025
[16]

Ho, Mark S

Zihan Huang, Charles Low, Mengqiu Teng, Hongyi Zhang, Daniel E. Ho, Mark S. Krass, and Matthias Grabmair. Context-aware legal citation recommendation using deep learning. InPro- ceedings of the Eighteenth International Conference on Artificial Intelligence and Law, ICAIL ’21, page 79–88, New York, NY, USA, 2021. Association for Computing Machinery. ISBN 9...

work page doi:10.1145/3462757.3466066 2021
[17]

Tetreault, and Daryna Dementieva

Antonia Karamolegkou, Angana Borah, Eunjung Cho, Sagnik Ray Choudhury, Martina Gal- letti, Rajarshi Ghosh, Pranav Gupta, Oana Ignat, Priyanka Kargupta, Neema Kotonya, Hemank Lamba, Sun-Joo Lee, Arushi Mangla, Ishani Mondal, Deniz Nazarova, Poli Nemkova, Dina Pisarevskaya, Naquee Rizwan, Nazanin Sabri, Dominik Stammbach, Anna Steinberg, David Tomás, Steven...

work page arXiv 2025
[18]

Gpt-4 passes the bar exam.Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineer- ing Sciences, 382(2270):20230254, 02 2024

Daniel Martin Katz, Michael James Bommarito, Shang Gao, and Pablo Arredondo. Gpt-4 passes the bar exam.Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineer- ing Sciences, 382(2270):20230254, 02 2024. ISSN 1364-503X. doi: 10.1098/rsta.2023.0254. URL https://doi.org/10.1098/rsta.2023.0254

work page doi:10.1098/rsta.2023.0254 2024
[19]

Coliee 2022 summary: Methods for legal document retrieval and entailment

Mi-Young Kim, Juliano Rabelo, Randy Goebel, Masaharu Yoshioka, Yoshinobu Kano, and Ken Satoh. Coliee 2022 summary: Methods for legal document retrieval and entailment. InNew Frontiers in Artificial Intelligence: JSAI-IsAI 2022 Workshop, JURISIN 2022, and JSAI 2022 Interna- tional Session, Kyoto, Japan, June 12–17, 2022, Revised Selected Papers, page 51–67...

work page doi:10.1007/978-3-031-29168-5_4 2022
[20]

NV-Embed: Improved Techniques for Training LLMs as Generalist Embedding Models

Chankyu Lee, Rajarshi Roy, Mengyao Xu, Jonathan Raiman, Mohammad Shoeybi, Bryan Catan- zaro, and Wei Ping. Nv-embed: Improved techniques for training llms as generalist embedding models.arXiv preprint arXiv:2405.17428, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[21]

Bm25s: Orders of magnitude faster lexical search via eager sparse scoring, 2024

Xing Han Lù. Bm25s: Orders of magnitude faster lexical search via eager sparse scoring, 2024. URLhttps://arxiv.org/abs/2407.03618

work page arXiv 2024
[22]

emnlp-main.608

Robert Mahari, Dominik Stammbach, Elliott Ash, and Alex Pentland. The law and NLP: Bridg- ing disciplinary disconnects. In Houda Bouamor, Juan Pino, and Kalika Bali, editors,Findings of the Association for Computational Linguistics: EMNLP 2023, pages 3445–3454, Singapore, Decem- ber 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.fi...

work page doi:10.18653/v1/2023.findings-emnlp 2023
[23]

URLhttps://aclanthology.org/2023.findings-emnlp.224/

work page 2023
[24]

LePaRD: A large-scale dataset of judicial citations to precedent

Robert Mahari, Dominik Stammbach, Elliott Ash, and Alex Pentland. LePaRD: A large-scale dataset of judicial citations to precedent. In Lun-Wei Ku, Andre Martins, and Vivek Srikumar, editors,Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Vol- ume 1: Long Papers), pages 9863–9877, Bangkok, Thailand, August 2024. Ass...

work page doi:10.18653/v1/2024.acl-long.532 2024
[25]

Plain English summarization of contracts

Laura Manor and Junyi Jessy Li. Plain English summarization of contracts. InProceedings of the Natural Legal Language Processing Workshop 2019, pages 1–11, Minneapolis, Minnesota, June

work page 2019
[26]

URLhttps://www.aclweb.org/anthology/ W19-2201

Association for Computational Linguistics. URLhttps://www.aclweb.org/anthology/ W19-2201

work page
[27]

MTEB : Massive text embedding benchmark

Niklas Muennighoff, Nouamane Tazi, Loic Magne, and Nils Reimers. MTEB: Massive text embedding benchmark. In Andreas Vlachos and Isabelle Augenstein, editors,Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics, pages 2014–2037, Dubrovnik, Croatia, May 2023. Association for Computational Linguistics. d...

work page doi:10.18653/v1/2023.eacl-main.148 2014
[28]

Operator system card

OpenAI. Operator system card. System card, OpenAI, January 2025. URLhttps://cdn.openai. com/operator_system_card.pdf

work page 2025
[29]

Map- ping global dynamics of benchmark creation and saturation in artificial intelligence.Nature Communications, 13(1):6793, 2022

Simon Ott, Adriano Barbosa-Silva, Kathrin Blagec, Jan Brauner, and Matthias Samwald. Map- ping global dynamics of benchmark creation and saturation in artificial intelligence.Nature Communications, 13(1):6793, 2022

work page 2022
[30]

Pace, Malia N

Nicholas M. Pace, Malia N. Brink, Cynthia G. Lee, and Stephen F. Hanlon. National public defense workload study. Technical report, RAND Corporation, 2023

work page 2023
[31]

Scikit-learn: Machine learning in python.J

Fabian Pedregosa, Gaël Varoquaux, Alexandre Gramfort, Vincent Michel, Bertrand Thirion, Olivier Grisel, Mathieu Blondel, Peter Prettenhofer, Ron Weiss, Vincent Dubourg, Jake Vander- plas, Alexandre Passos, David Cournapeau, Matthieu Brucher, Matthieu Perrot, and Édouard Duchesnay. Scikit-learn: Machine learning in python.J. Mach. Learn. Res., 12(null):282...

work page 2011
[32]

olmocr: Unlocking trillions of tokens in pdfs with vi- sion language models.arXiv preprint arXiv:2502.18443, 2025a

Jake Poznanski, Jon Borchardt, Jason Dunkelberger, Regan Huff, Daniel Lin, Aman Rangapur, Christopher Wilhelm, Kyle Lo, and Luca Soldaini. olmOCR: Unlocking Trillions of Tokens in PDFs with Vision Language Models, 2025. URLhttps://arxiv.org/abs/2502.18443

work page arXiv 2025
[33]

Sentence- BERT : Sentence Embeddings using S iamese BERT -Networks

Nils Reimers and Iryna Gurevych. Sentence-BERT: Sentence embeddings using Siamese BERT- networks. In Kentaro Inui, Jing Jiang, Vincent Ng, and Xiaojun Wan, editors,Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 3982–3992,...

work page doi:10.18653/v1/d19-1410 2019
[34]

Daniel Schwarcz and Jonathan H. Choi. Ai tools for lawyers: A practical guide.Minnesota Law Review Headnotes, 108:1, March 2023. doi: 10.2139/ssrn.4404017. URLhttps://ssrn.com/ abstract=4404017. Minnesota Legal Studies Research Paper

work page doi:10.2139/ssrn.4404017 2023
[35]

Evaluating chunking strategies for retrieval

Brandon Smith and Anton Troynikov. Evaluating chunking strategies for retrieval. Technical report, Chroma, July 2024. URLhttps://research.trychroma.com/evaluating-chunking

work page 2024
[36]

Empowering Legal Aid

Stanford Law School Legal Design Lab. Empowering legal aid: Developing ai co-pilots for eviction defense and reentry debt mitigation. Press release / in-brief article by Stanford Law School, 2025. URLhttps://law.stanford.edu/press/empowering-legal-aid/. Stanford Law School, “Empowering Legal Aid” initiative description; accessed August 3, 2025. 18 Legal R...

work page 2025
[37]

Manning, Peter Hender- son, and Daniel E

Faiz Surani, Mirac Suzgun, Vyoma Raman, Christopher D. Manning, Peter Henderson, and Daniel E. Ho. Ai for scaling legal reform: Mapping and redacting racial covenants in santa clara county, 2025. URLhttps://arxiv.org/abs/2503.03888

work page arXiv 2025
[38]

Visualizing data using t-sne.Journal of Machine Learning Research, 9(86):2579–2605, 2008

Laurens van der Maaten and Geoffrey Hinton. Visualizing data using t-sne.Journal of Machine Learning Research, 9(86):2579–2605, 2008. URLhttp://jmlr.org/papers/v9/vandermaaten08a. html

work page 2008
[39]

Text Embeddings by Weakly-Supervised Contrastive Pre-training

Liang Wang, Nan Yang, Xiaolong Huang, Binxing Jiao, Linjun Yang, Daxin Jiang, Rangan Ma- jumder, and Furu Wei. Text embeddings by weakly-supervised contrastive pre-training.arXiv preprint arXiv:2212.03533, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[40]

Improving text embeddings with large language models,

Liang Wang, Nan Yang, Xiaolong Huang, Linjun Yang, Rangan Majumder, and Furu Wei. Im- proving text embeddings with large language models.arXiv preprint arXiv:2401.00368, 2023

work page arXiv 2023
[41]

Do llms truly understand when a precedent is overruled?, 2025

Li Zhang, Jaromir Savelka, and Kevin Ashley. Do llms truly understand when a precedent is overruled?, 2025. URLhttps://arxiv.org/abs/2510.20941

work page arXiv 2025
[42]

Qwen3 Embedding: Advancing Text Embedding and Reranking Through Foundation Models

Yanzhao Zhang, Mingxin Li, Dingkun Long, Xin Zhang, Huan Lin, Baosong Yang, Pengjun Xie, An Yang, Dayiheng Liu, Junyang Lin, Fei Huang, and Jingren Zhou. Qwen3 embedding: Advancing text embedding and reranking through foundation models, 2025. URLhttps:// arxiv.org/abs/2506.05176

work page internal anchor Pith review Pith/arXiv arXiv 2025
[43]

WildChat: 1M ChatGPT Interaction Logs in the Wild

Wenting Zhao, Xiang Ren, Jack Hessel, Claire Cardie, Yejin Choi, and Yuntian Deng. Wildchat: 1m chatgpt interaction logs in the wild, 2024. URLhttps://arxiv.org/abs/2405.01470

work page internal anchor Pith review Pith/arXiv arXiv 2024
[44]

Manning, Peter Henderson, and Daniel E

Lucia Zheng, Neel Guha, Javokhir Arifov, Sarah Zhang, Michal Skreta, Christopher D. Man- ning, Peter Henderson, and Daniel E. Ho. A reasoning-focused legal retrieval benchmark. InProceedings of the 2025 Symposium on Computer Science and Law, CSLAW ’25, page 169–193, New York, NY, USA, 2025. Association for Computing Machinery. ISBN 9798400714214. doi: 10....

work page doi:10.1145/3709025.3712219 2025
[45]

Infer the key legal issue(s) raised by the query

work page
[46]

State the applicable legal rule(s) in general doctrinal terms

work page
[47]

Optionally provide a brief legal analysis (reasoning) if it helps clarify the issue and rule

work page
[48]

merger of offenses

Construct an augmented search query that incorporates the original query plus useful legal reasoning signals (issues, rules, key concepts, doctrinal terms). This augmented query may be in any style (keywords, IRAC-style summary, or a well-structured legal question), as long as it is helpful for retrieving relevant cases and statutes. Important constraints...

work page

[1] [1]

Wainwright, 372 U.S

Gideon v. Wainwright, 372 U.S. 335 (1963), 1963. URLhttps://supreme.justia.com/cases/ federal/us/372/335/

work page 1963

[2] [2]

On the Opportunities and Risks of Foundation Models

Rishi Bommasani, Drew A. Hudson, Ehsan Adeli, Russ Altman, Simran Arora, Sydney von Arx, Michael S. Bernstein, Jeannette Bohg, Antoine Bosselut, Emma Brunskill, Erik Brynjolfs- son, Shyamal Buch, Dallas Card, Rodrigo Castellon, Niladri Chatterji, Annie Chen, Kathleen Creel, Jared Quincy Davis, Dora Demszky, Chris Donahue, Moussa Doumbouya, Esin Dur- mus, ...

work page internal anchor Pith review Pith/arXiv arXiv 2022

[3] [3]

The massive legal embedding benchmark (mleb), 2025

Umar Butler, Abdur-Rahman Butler, and Adrian Lucas Malec. The massive legal embedding benchmark (mleb), 2025. URLhttps://arxiv.org/abs/2510.19365

work page arXiv 2025

[4] [4]

AIR-bench: Automated heterogeneous information retrieval benchmark

Jianlyu Chen, Nan Wang, Chaofan Li, Bo Wang, Shitao Xiao, Han Xiao, Hao Liao, Defu Lian, and Zheng Liu. AIR-bench: Automated heterogeneous information retrieval benchmark. In Wanxiang Che, Joyce Nabende, Ekaterina Shutova, and Mohammad Taher Pilehvar, editors, Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume ...

work page 2025

[5] [5]

How Can AI Augment Access to Justice? Public Defenders' Perspectives on AI Adoption

Inyoung Cheong, Patty Liu, Dominik Stammbach, and Peter Henderson. How can ai augment access to justice? public defenders’ perspectives on ai adoption, 2025. URLhttps://arxiv. org/abs/2510.22933. 15 Legal Retrieval for Public Defenders

work page internal anchor Pith review Pith/arXiv arXiv 2025

[6] [6]

Record clearance at scale: How clear my record helped reduce or dismiss 144,000 convictions in california, 2020

Code for America. Record clearance at scale: How clear my record helped reduce or dismiss 144,000 convictions in california, 2020. URLhttps://codeforamerica.org/news/ record-clearance-at-scale-how-clear-my-record-helped-reduce-or-dismiss-144-000-convictions-in-california/. Accessed: 2025-08-03

work page 2020

[7] [7]

Modeling law search as prediction.Artificial Intelligence and Law, 29(1):3–34, 2021

Faraz Dadgostari, Mauricio Guim, Peter A Beling, Michael A Livermore, and Daniel N Rock- more. Modeling law search as prediction.Artificial Intelligence and Law, 29(1):3–34, 2021

work page 2021

[8] [8]

Mind2Web: Towards a Generalist Agent for the Web

Xiang Deng, Yu Gu, Boyuan Zheng, Shijie Chen, Samuel Stevens, Boshi Wang, Huan Sun, and Yu Su. Mind2web: Towards a generalist agent for the web, 2023. URLhttps://arxiv.org/abs/ 2306.06070

work page internal anchor Pith review Pith/arXiv arXiv 2023

[9] [9]

Lawma: The power of specialization for legal annotation, 2025

Ricardo Dominguez-Olmedo, Vedant Nanda, Rediet Abebe, Stefan Bechtold, Christoph Engel, Jens Frankenreiter, Krishna Gummadi, Moritz Hardt, and Michael Livermore. Lawma: The power of specialization for legal annotation, 2025. URLhttps://arxiv.org/abs/2407.16615

work page arXiv 2025

[10] [10]

Barry C. Edwards. Why appeals courts rarely reverse lower courts: An experimental study to explore affirmation bias.Emory Law Journal Online, 68:1035–1073, 2019. URLhttps: //scholarlycommons.law.emory.edu/elj-online/7

work page 2019

[11] [11]

Scaling deep contrastive learning batch size under memory limited setup, 2021

Luyu Gao, Yunyi Zhang, Jiawei Han, and Jamie Callan. Scaling deep contrastive learning batch size under memory limited setup, 2021. URLhttps://arxiv.org/abs/2101.06983

work page arXiv 2021

[12] [12]

The Llama 3 Herd of Models

Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ah- mad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, Amy Yang, and Zhiyu Ma et al. The llama 3 herd of models, 2024. URLhttps://arxiv.org/abs/2407.21783

work page internal anchor Pith review Pith/arXiv arXiv 2024

[13] [13]

The illusion of readiness: Stress testing large frontier models on multimodal medi- cal benchmarks, 2025

Yu Gu, Jingjing Fu, Xiaodong Liu, Jeya Maria Jose Valanarasu, Noel Codella, Reuben Tan, Qianchu Liu, Ying Jin, Sheng Zhang, Jinyu Wang, Rui Wang, Lei Song, Guanghui Qin, Naoto Usuyama, Cliff Wong, Cheng Hao, Hohin Lee, Praneeth Sanapathi, Sarah Hilado, Bian Jiang, Javier Alvarez-Valle, Mu Wei, Jianfeng Gao, Eric Horvitz, Matt Lungren, Hoifung Poon, and Pa...

work page arXiv 2025

[14] [14]

Ho, Christopher Ré, Adam Chilton, Aditya Narayana, Alex Chohlas-Wood, Austin Peters, Brandon Waldon, Daniel N

Neel Guha, Julian Nyarko, Daniel E. Ho, Christopher Ré, Adam Chilton, Aditya Narayana, Alex Chohlas-Wood, Austin Peters, Brandon Waldon, Daniel N. Rockmore, Diego Zambrano, Dmitry Talisman, Enam Hoque, Faiz Surani, Frank Fagan, Galit Sarfaty, Gregory M. Dickinson, Haggai Porat, Jason Hegland, Jessica Wu, Joe Nudell, Joel Niklaus, John Nay, Jonathan H. Cho...

work page 2023

[15] [15]

CLERC: A dataset for U

Abe Bohan Hou, Orion Weller, Guanghui Qin, Eugene Yang, Dawn Lawrie, Nils Holzenberger, Andrew Blair-Stanek, and Benjamin Van Durme. CLERC: A dataset for U. S. legal case retrieval and retrieval-augmented analysis generation. In Luis Chiruzzo, Alan Ritter, and Lu Wang, ed- itors,Findings of the Association for Computational Linguistics: NAACL 2025, pages ...

work page doi:10.18653/v1/2025.findings-naacl.441 2025

[16] [16]

Ho, Mark S

Zihan Huang, Charles Low, Mengqiu Teng, Hongyi Zhang, Daniel E. Ho, Mark S. Krass, and Matthias Grabmair. Context-aware legal citation recommendation using deep learning. InPro- ceedings of the Eighteenth International Conference on Artificial Intelligence and Law, ICAIL ’21, page 79–88, New York, NY, USA, 2021. Association for Computing Machinery. ISBN 9...

work page doi:10.1145/3462757.3466066 2021

[17] [17]

Tetreault, and Daryna Dementieva

Antonia Karamolegkou, Angana Borah, Eunjung Cho, Sagnik Ray Choudhury, Martina Gal- letti, Rajarshi Ghosh, Pranav Gupta, Oana Ignat, Priyanka Kargupta, Neema Kotonya, Hemank Lamba, Sun-Joo Lee, Arushi Mangla, Ishani Mondal, Deniz Nazarova, Poli Nemkova, Dina Pisarevskaya, Naquee Rizwan, Nazanin Sabri, Dominik Stammbach, Anna Steinberg, David Tomás, Steven...

work page arXiv 2025

[18] [18]

Gpt-4 passes the bar exam.Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineer- ing Sciences, 382(2270):20230254, 02 2024

Daniel Martin Katz, Michael James Bommarito, Shang Gao, and Pablo Arredondo. Gpt-4 passes the bar exam.Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineer- ing Sciences, 382(2270):20230254, 02 2024. ISSN 1364-503X. doi: 10.1098/rsta.2023.0254. URL https://doi.org/10.1098/rsta.2023.0254

work page doi:10.1098/rsta.2023.0254 2024

[19] [19]

Coliee 2022 summary: Methods for legal document retrieval and entailment

Mi-Young Kim, Juliano Rabelo, Randy Goebel, Masaharu Yoshioka, Yoshinobu Kano, and Ken Satoh. Coliee 2022 summary: Methods for legal document retrieval and entailment. InNew Frontiers in Artificial Intelligence: JSAI-IsAI 2022 Workshop, JURISIN 2022, and JSAI 2022 Interna- tional Session, Kyoto, Japan, June 12–17, 2022, Revised Selected Papers, page 51–67...

work page doi:10.1007/978-3-031-29168-5_4 2022

[20] [20]

NV-Embed: Improved Techniques for Training LLMs as Generalist Embedding Models

Chankyu Lee, Rajarshi Roy, Mengyao Xu, Jonathan Raiman, Mohammad Shoeybi, Bryan Catan- zaro, and Wei Ping. Nv-embed: Improved techniques for training llms as generalist embedding models.arXiv preprint arXiv:2405.17428, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[21] [21]

Bm25s: Orders of magnitude faster lexical search via eager sparse scoring, 2024

Xing Han Lù. Bm25s: Orders of magnitude faster lexical search via eager sparse scoring, 2024. URLhttps://arxiv.org/abs/2407.03618

work page arXiv 2024

[22] [22]

emnlp-main.608

Robert Mahari, Dominik Stammbach, Elliott Ash, and Alex Pentland. The law and NLP: Bridg- ing disciplinary disconnects. In Houda Bouamor, Juan Pino, and Kalika Bali, editors,Findings of the Association for Computational Linguistics: EMNLP 2023, pages 3445–3454, Singapore, Decem- ber 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.fi...

work page doi:10.18653/v1/2023.findings-emnlp 2023

[23] [23]

URLhttps://aclanthology.org/2023.findings-emnlp.224/

work page 2023

[24] [24]

LePaRD: A large-scale dataset of judicial citations to precedent

Robert Mahari, Dominik Stammbach, Elliott Ash, and Alex Pentland. LePaRD: A large-scale dataset of judicial citations to precedent. In Lun-Wei Ku, Andre Martins, and Vivek Srikumar, editors,Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Vol- ume 1: Long Papers), pages 9863–9877, Bangkok, Thailand, August 2024. Ass...

work page doi:10.18653/v1/2024.acl-long.532 2024

[25] [25]

Plain English summarization of contracts

Laura Manor and Junyi Jessy Li. Plain English summarization of contracts. InProceedings of the Natural Legal Language Processing Workshop 2019, pages 1–11, Minneapolis, Minnesota, June

work page 2019

[26] [26]

URLhttps://www.aclweb.org/anthology/ W19-2201

Association for Computational Linguistics. URLhttps://www.aclweb.org/anthology/ W19-2201

work page

[27] [27]

MTEB : Massive text embedding benchmark

Niklas Muennighoff, Nouamane Tazi, Loic Magne, and Nils Reimers. MTEB: Massive text embedding benchmark. In Andreas Vlachos and Isabelle Augenstein, editors,Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics, pages 2014–2037, Dubrovnik, Croatia, May 2023. Association for Computational Linguistics. d...

work page doi:10.18653/v1/2023.eacl-main.148 2014

[28] [28]

Operator system card

OpenAI. Operator system card. System card, OpenAI, January 2025. URLhttps://cdn.openai. com/operator_system_card.pdf

work page 2025

[29] [29]

Map- ping global dynamics of benchmark creation and saturation in artificial intelligence.Nature Communications, 13(1):6793, 2022

Simon Ott, Adriano Barbosa-Silva, Kathrin Blagec, Jan Brauner, and Matthias Samwald. Map- ping global dynamics of benchmark creation and saturation in artificial intelligence.Nature Communications, 13(1):6793, 2022

work page 2022

[30] [30]

Pace, Malia N

Nicholas M. Pace, Malia N. Brink, Cynthia G. Lee, and Stephen F. Hanlon. National public defense workload study. Technical report, RAND Corporation, 2023

work page 2023

[31] [31]

Scikit-learn: Machine learning in python.J

Fabian Pedregosa, Gaël Varoquaux, Alexandre Gramfort, Vincent Michel, Bertrand Thirion, Olivier Grisel, Mathieu Blondel, Peter Prettenhofer, Ron Weiss, Vincent Dubourg, Jake Vander- plas, Alexandre Passos, David Cournapeau, Matthieu Brucher, Matthieu Perrot, and Édouard Duchesnay. Scikit-learn: Machine learning in python.J. Mach. Learn. Res., 12(null):282...

work page 2011

[32] [32]

olmocr: Unlocking trillions of tokens in pdfs with vi- sion language models.arXiv preprint arXiv:2502.18443, 2025a

Jake Poznanski, Jon Borchardt, Jason Dunkelberger, Regan Huff, Daniel Lin, Aman Rangapur, Christopher Wilhelm, Kyle Lo, and Luca Soldaini. olmOCR: Unlocking Trillions of Tokens in PDFs with Vision Language Models, 2025. URLhttps://arxiv.org/abs/2502.18443

work page arXiv 2025

[33] [33]

Sentence- BERT : Sentence Embeddings using S iamese BERT -Networks

Nils Reimers and Iryna Gurevych. Sentence-BERT: Sentence embeddings using Siamese BERT- networks. In Kentaro Inui, Jing Jiang, Vincent Ng, and Xiaojun Wan, editors,Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 3982–3992,...

work page doi:10.18653/v1/d19-1410 2019

[34] [34]

Daniel Schwarcz and Jonathan H. Choi. Ai tools for lawyers: A practical guide.Minnesota Law Review Headnotes, 108:1, March 2023. doi: 10.2139/ssrn.4404017. URLhttps://ssrn.com/ abstract=4404017. Minnesota Legal Studies Research Paper

work page doi:10.2139/ssrn.4404017 2023

[35] [35]

Evaluating chunking strategies for retrieval

Brandon Smith and Anton Troynikov. Evaluating chunking strategies for retrieval. Technical report, Chroma, July 2024. URLhttps://research.trychroma.com/evaluating-chunking

work page 2024

[36] [36]

Empowering Legal Aid

Stanford Law School Legal Design Lab. Empowering legal aid: Developing ai co-pilots for eviction defense and reentry debt mitigation. Press release / in-brief article by Stanford Law School, 2025. URLhttps://law.stanford.edu/press/empowering-legal-aid/. Stanford Law School, “Empowering Legal Aid” initiative description; accessed August 3, 2025. 18 Legal R...

work page 2025

[37] [37]

Manning, Peter Hender- son, and Daniel E

Faiz Surani, Mirac Suzgun, Vyoma Raman, Christopher D. Manning, Peter Henderson, and Daniel E. Ho. Ai for scaling legal reform: Mapping and redacting racial covenants in santa clara county, 2025. URLhttps://arxiv.org/abs/2503.03888

work page arXiv 2025

[38] [38]

Visualizing data using t-sne.Journal of Machine Learning Research, 9(86):2579–2605, 2008

Laurens van der Maaten and Geoffrey Hinton. Visualizing data using t-sne.Journal of Machine Learning Research, 9(86):2579–2605, 2008. URLhttp://jmlr.org/papers/v9/vandermaaten08a. html

work page 2008

[39] [39]

Text Embeddings by Weakly-Supervised Contrastive Pre-training

Liang Wang, Nan Yang, Xiaolong Huang, Binxing Jiao, Linjun Yang, Daxin Jiang, Rangan Ma- jumder, and Furu Wei. Text embeddings by weakly-supervised contrastive pre-training.arXiv preprint arXiv:2212.03533, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[40] [40]

Improving text embeddings with large language models,

Liang Wang, Nan Yang, Xiaolong Huang, Linjun Yang, Rangan Majumder, and Furu Wei. Im- proving text embeddings with large language models.arXiv preprint arXiv:2401.00368, 2023

work page arXiv 2023

[41] [41]

Do llms truly understand when a precedent is overruled?, 2025

Li Zhang, Jaromir Savelka, and Kevin Ashley. Do llms truly understand when a precedent is overruled?, 2025. URLhttps://arxiv.org/abs/2510.20941

work page arXiv 2025

[42] [42]

Qwen3 Embedding: Advancing Text Embedding and Reranking Through Foundation Models

Yanzhao Zhang, Mingxin Li, Dingkun Long, Xin Zhang, Huan Lin, Baosong Yang, Pengjun Xie, An Yang, Dayiheng Liu, Junyang Lin, Fei Huang, and Jingren Zhou. Qwen3 embedding: Advancing text embedding and reranking through foundation models, 2025. URLhttps:// arxiv.org/abs/2506.05176

work page internal anchor Pith review Pith/arXiv arXiv 2025

[43] [43]

WildChat: 1M ChatGPT Interaction Logs in the Wild

Wenting Zhao, Xiang Ren, Jack Hessel, Claire Cardie, Yejin Choi, and Yuntian Deng. Wildchat: 1m chatgpt interaction logs in the wild, 2024. URLhttps://arxiv.org/abs/2405.01470

work page internal anchor Pith review Pith/arXiv arXiv 2024

[44] [44]

Manning, Peter Henderson, and Daniel E

Lucia Zheng, Neel Guha, Javokhir Arifov, Sarah Zhang, Michal Skreta, Christopher D. Man- ning, Peter Henderson, and Daniel E. Ho. A reasoning-focused legal retrieval benchmark. InProceedings of the 2025 Symposium on Computer Science and Law, CSLAW ’25, page 169–193, New York, NY, USA, 2025. Association for Computing Machinery. ISBN 9798400714214. doi: 10....

work page doi:10.1145/3709025.3712219 2025

[45] [45]

Infer the key legal issue(s) raised by the query

work page

[46] [46]

State the applicable legal rule(s) in general doctrinal terms

work page

[47] [47]

Optionally provide a brief legal analysis (reasoning) if it helps clarify the issue and rule

work page

[48] [48]

merger of offenses

Construct an augmented search query that incorporates the original query plus useful legal reasoning signals (issues, rules, key concepts, doctrinal terms). This augmented query may be in any style (keywords, IRAC-style summary, or a well-structured legal question), as long as it is helpful for retrieving relevant cases and statutes. Important constraints...

work page