Retrieval-Based Multi-Label Legal Annotation: Extensible, Data-Efficient and Hallucination-Free

Jaromir Savelka; Kevin Ashley; Li Zhang

arxiv: 2605.16767 · v1 · pith:W35KERMQnew · submitted 2026-05-16 · 💻 cs.CL

Retrieval-Based Multi-Label Legal Annotation: Extensible, Data-Efficient and Hallucination-Free

Li Zhang , Jaromir Savelka , Kevin Ashley This is my paper

Pith reviewed 2026-05-19 21:36 UTC · model grok-4.3

classification 💻 cs.CL

keywords multi-label annotationlegal documentsretrieval modelsdata efficiencyhallucination preventionnearest neighborlarge language models

0 comments

The pith

Retrieval in a frozen embedding space assigns multiple legal labels to documents with competitive accuracy, strong data efficiency, and no risk of hallucinating outside the taxonomy.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper reframes multi-label legal annotation as retrieval: both long documents and label descriptions are embedded by a fixed model, after which labels are assigned by nearest-neighbor lookup in the shared space. This design removes the need to retrain when taxonomies change and guarantees that every prediction stays inside the supplied label set. Experiments across three legal datasets show the method matching or beating zero-shot generative models and fine-tuned encoders while using far less compute and far fewer training examples.

Core claim

The authors show that embedding documents and candidate label texts with a frozen retrieval model and predicting via k-nearest neighbors produces accurate multi-label annotations on ECtHR-A, ECtHR-B, and Eurlex. On Eurlex the approach reaches Macro-F1 of 49.12 versus 40.41 for zero-shot GPT-5.2; with only 100 training samples it nearly doubles the Micro-F1 of hierarchical Legal-BERT on ECtHR-A. Because labels are drawn exclusively from the indexed set, the method never produces labels outside the taxonomy.

What carries the argument

k-nearest-neighbor lookup between document embeddings and label-description embeddings produced by a frozen retrieval model.

If this is right

New labels can be added by embedding their descriptions and updating the index with no gradient-based retraining required.
Every prediction is drawn only from the supplied taxonomy, eliminating hallucination outside the defined label set.
Competitive accuracy is achieved with as few as 100 labeled training documents on complex legal tasks.
Estimated compute drops by a factor of 20-30 relative to fine-tuning large generative models.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same frozen-retrieval pattern could be applied directly to other domains that maintain large, frequently updated label sets such as medical coding or regulatory compliance.
A hybrid pipeline could use retrieval to select the label set and then invoke a small generative model only to produce human-readable justifications.
Gains would likely widen as stronger general-purpose retrieval models become available without any legal-domain fine-tuning.

Load-bearing premise

Similarity in the frozen retrieval embedding space reliably indicates whether a label applies to a long, fact-intensive legal document without any task-specific adaptation of the embedder.

What would settle it

A test set of legal documents whose correct labels depend on subtle statutory distinctions that general semantic embeddings do not capture, causing retrieval F1 to fall below fine-tuned baselines despite ample data.

Figures

Figures reproduced from arXiv: 2605.16767 by Jaromir Savelka, Kevin Ashley, Li Zhang.

**Figure 1.** Figure 1: Comparison of Legal Annotation Paradigms (Inference Time). (a) Parametric Fine-tuning (BERT): Requires updating model weights, data-hungry, rigid. (b) Generative Zero-shot (GPT-5.2): Context window limited, expensive, slow. (c) Proposed Retrieval Model (Qwen-3 Embedding): Retrieval-based, plug-and-play, handles a large and evolving set of labels efficiently. task-specific fine-tuning [7]. However, they are… view at source ↗

**Figure 3.** Figure 3: Performance scaling of Legal-BERT vs. Qwen Retrieval Models across varying training sample sizes [PITH_FULL_IMAGE:figures/full_fig_p010_3.png] view at source ↗

read the original abstract

Multi-label legal annotation requires assigning multiple labels from large, evolving taxonomies to long, fact-intensive documents, often under limited supervision. Parametric encoders typically require task-specific training and retraining when the label set changes, while prompting generative large language models becomes costly and degrades as the label space grows. We cast legal annotation as retrieval: we embed documents and label descriptions with a frozen retrieval model and predict labels via k-nearest neighbors in the embedding space, enabling updates by re-embedding and re-indexing rather than gradient-based backpropagation. Across three legal datasets (ECtHR-A, ECtHR-B, and Eurlex with 100 labels), retrieval achieves competitive accuracy and strong data efficiency; on Eurlex, Qwen-8B retrieval improves Macro-F1 from 40.41 (GPT-5.2, zero-shot) to 49.12 while reducing estimated compute by 20-30 times compared to fine-tuning. With only (N=100) training samples, retrieval nearly doubles Micro-F1 over hierarchical Legal-BERT on ECtHR-A (48.29 vs. 27.87). We also quantify a reliability failure mode of generative inference: GPT-5.2 hallucinates labels outside the provided taxonomy in 0.12-0.9% of test samples under deterministic decoding. In contrast, retrieval strictly respects defined label sets, eliminating hallucination by design. These results suggest retrieval-model-based annotators are a practical, deployable alternative for high-cardinality and rapidly changing legal label spaces.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Retrieval with frozen embeddings beats zero-shot GPT-5.2 on these legal datasets for accuracy and efficiency while eliminating hallucinations by design, though the similarity assumption still needs scrutiny.

read the letter

Retrieval with a frozen embedder and kNN gives higher F1 than zero-shot GPT-5.2 on Eurlex and much better data efficiency than fine-tuned Legal-BERT on ECtHR, all while guaranteeing the output stays inside the taxonomy. The new pieces are the side-by-side numbers on three legal datasets, including the explicit hallucination counts for the generative model and the low-data regime results. The compute savings of 20-30 times versus fine-tuning are also spelled out. This combination of retrieval for annotation in a high-cardinality legal setting feels like a fresh practical application. It does well at highlighting the updateability benefit: when labels change you just re-embed and re-index instead of retraining. That is a real advantage for evolving legal taxonomies. The soft spot is the central assumption that similarity in the frozen space reliably signals whether a label applies to the facts in the document. Legal decisions often hinge on subtle distinctions that pure topical similarity might miss. The paper would benefit from some case studies or failure analysis on where the retrieval gets it wrong. This is for practitioners and researchers who need annotation methods that scale with changing label sets and run on modest resources. People comparing LLM prompting to retrieval methods in domain-specific settings will get concrete numbers to think about. It deserves peer review. The experiments are grounded in public data and the claims are testable.

Referee Report

3 major / 2 minor

Summary. The manuscript proposes casting multi-label legal annotation as a kNN retrieval task in the embedding space of a frozen retrieval model applied to documents and label descriptions. This enables extensible updates via re-indexing rather than retraining. Experiments on ECtHR-A, ECtHR-B, and Eurlex (100 labels) report competitive Macro- and Micro-F1 scores, strong data efficiency (e.g., N=100 samples nearly doubling Micro-F1 over hierarchical Legal-BERT on ECtHR-A), 20-30x compute savings versus fine-tuning, and zero hallucinations by construction versus GPT-5.2 zero-shot prompting.

Significance. If the central results hold, the work demonstrates a practical, hallucination-free alternative for high-cardinality and evolving legal taxonomies that avoids gradient updates and scales with limited supervision. The quantified data-efficiency gains and explicit contrast to generative hallucination rates provide a clear deployability argument for resource-constrained legal annotation pipelines.

major comments (3)

[Approach / Methods] The central claim that frozen-embedding kNN reliably identifies label applicability for long, fact-intensive legal documents rests on an untested assumption; no ablation compares the frozen embedder against task-adapted or fine-tuned variants, which is load-bearing for the assertion that retrieval is sufficient without domain-specific adaptation.
[Experimental Setup] k is listed as the sole free parameter, yet no concrete value, selection procedure, or sensitivity analysis is reported; this directly affects reproducibility of the claimed F1 improvements (e.g., Eurlex Macro-F1 49.12 and ECtHR-A Micro-F1 48.29 at N=100).
[Results / Experiments] Table or results section reporting the GPT-5.2 and Legal-BERT baselines lacks dataset splits, exact training details, and error bars; without these it is impossible to verify whether the reported gains (40.41 to 49.12 Macro-F1 on Eurlex; 27.87 to 48.29 Micro-F1 on ECtHR-A) are robust to post-hoc implementation choices.

minor comments (2)

[Abstract / Experiments] Clarify the exact model referred to as 'GPT-5.2' and whether deterministic decoding was used uniformly across all generative baselines.
[Approach] Add a brief discussion of how label-description embeddings are constructed for the Eurlex 100-label taxonomy to aid replication.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We are grateful to the referee for the constructive comments that will help improve the clarity and reproducibility of our work. We address each major comment below and outline the revisions we plan to make.

read point-by-point responses

Referee: [Approach / Methods] The central claim that frozen-embedding kNN reliably identifies label applicability for long, fact-intensive legal documents rests on an untested assumption; no ablation compares the frozen embedder against task-adapted or fine-tuned variants, which is load-bearing for the assertion that retrieval is sufficient without domain-specific adaptation.

Authors: We thank the referee for highlighting this point. The manuscript intentionally focuses on the frozen setting to emphasize data efficiency, extensibility without retraining, and avoidance of hallucinations. The competitive results across datasets (e.g., outperforming GPT-5.2 zero-shot and hierarchical Legal-BERT with limited data) serve as empirical validation that frozen embeddings are sufficient for this task. A comparison to fine-tuned variants would be valuable but would require substantial additional experiments and compute, which we believe is beyond the current scope given the paper's emphasis on the no-adaptation advantage. We will add a paragraph in the discussion section to explicitly address this design decision and its relation to the claims. revision: no
Referee: [Experimental Setup] k is listed as the sole free parameter, yet no concrete value, selection procedure, or sensitivity analysis is reported; this directly affects reproducibility of the claimed F1 improvements (e.g., Eurlex Macro-F1 49.12 and ECtHR-A Micro-F1 48.29 at N=100).

Authors: We agree that this information is necessary for reproducibility. In the revised manuscript, we will report the specific value of k used in all experiments, describe how it was selected (e.g., via a small validation set), and include a sensitivity analysis plotting performance across a range of k values to demonstrate robustness. revision: yes
Referee: [Results / Experiments] Table or results section reporting the GPT-5.2 and Legal-BERT baselines lacks dataset splits, exact training details, and error bars; without these it is impossible to verify whether the reported gains (40.41 to 49.12 Macro-F1 on Eurlex; 27.87 to 48.29 Micro-F1 on ECtHR-A) are robust to post-hoc implementation choices.

Authors: We acknowledge the need for greater transparency in the baseline implementations. The revised version will include a detailed description of the dataset splits, the exact hyperparameters and training procedures for Legal-BERT, the prompting strategy for GPT-5.2, and error bars computed over multiple random seeds or runs where feasible. This will strengthen the verifiability of the reported improvements. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical results on external benchmarks

full rationale

The paper proposes casting multi-label legal annotation as kNN retrieval over frozen document and label-description embeddings, then reports accuracy, data-efficiency, and hallucination metrics from direct experiments on three public datasets (ECtHR-A, ECtHR-B, Eurlex) against fixed external baselines such as GPT-5.2 zero-shot and hierarchical Legal-BERT. No equations, parameters, or uniqueness claims reduce by construction to quantities defined inside the paper; the central performance numbers are measured on held-out test splits and are therefore independent of any internal fit or self-referential definition.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The approach depends on the domain assumption that frozen general-purpose retrieval embeddings capture legal label relevance and on the modeling choice of k in nearest-neighbor lookup; no new entities are postulated.

free parameters (1)

k (nearest neighbors)
The number of neighbors used for label prediction is a modeling choice whose value is not stated in the abstract.

axioms (1)

domain assumption Embedding-space similarity between document and label description vectors indicates label applicability for legal texts.
This premise underpins the decision to use frozen retrieval embeddings without task-specific training.

pith-pipeline@v0.9.0 · 5815 in / 1336 out tokens · 51058 ms · 2026-05-19T21:36:03.815453+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We cast legal annotation as retrieval: we embed documents and label descriptions with a frozen retrieval model and predict labels via k-nearest neighbors in the embedding space

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

37 extracted references · 37 canonical work pages · 8 internal anchors

[1]

K. D. Ashley, Artificial intelligence and legal analytics: new tools for law practice in the digital age, Cambridge University Press, 2017

work page 2017
[2]

Chalkidis, A

I. Chalkidis, A. Jana, D. Hartung, M. Bommarito, I. Androutsopoulos, D. Katz, N. Aletras, Lexglue: A benchmark dataset for legal language understanding in english, in: Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2022, pp. 4310–4330

work page 2022
[3]

Aletras, D

N. Aletras, D. Tsarapatsanis, D. Preoţiuc-Pietro, V. Lampos, Predicting judicial decisions of the european court of human rights: A natural language processing perspective, PeerJ computer science 2 (2016) e93

work page 2016
[4]

Chalkidis, E

I. Chalkidis, E. Fergadiotis, P. Malakasiotis, I. Androutsopoulos, Large-scale multi-label text classification on eu legislation, in: Proceedings of the 57th annual meeting of the association for computational linguistics, 2019, pp. 6314–6322

work page 2019
[5]

Chang, H.-F

W.-C. Chang, H.-F. Yu, K. Zhong, Y. Yang, I. S. Dhillon, Taming pretrained transformers for extreme multi-label text classification, in: Proceedings of the 26th ACM SIGKDD international conference on knowledge discovery & data mining, 2020, pp. 3163–3171

work page 2020
[6]

Chalkidis, M

I. Chalkidis, M. Fergadiotis, I. Androutsopoulos, Multieurlex–a multi-lingual and multi-label legal document classification dataset for zero-shot cross-lingual transfer, arXiv preprint arXiv:2109.00904 (2021)

work page arXiv 2021
[7]

arXiv , eprintclass =

I. Chalkidis, M. Fergadiotis, P. Malakasiotis, N. Aletras, I. Androutsopoulos, Legal-bert: The muppets straight out of law school, arXiv preprint arXiv:2010.02559 (2020)

work page arXiv 2010
[8]

Brown, B

T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, et al., Language models are few-shot learners, Advances in neural information processing systems 33 (2020) 1877–1901

work page 2020
[9]

Savelka, K

J. Savelka, K. D. Ashley, The unreasonable effectiveness of large language models in zero-shot semantic annotation of legal texts, Frontiers in Artificial Intelligence 6 (2023) 1279794

work page 2023
[10]

H. Lee, K. C. Li, M. Grabmair, S. Xu, Efficient prompt optimisation for legal text classification with proxy prompt evaluator, in: Proceedings of the Natural Legal Language Processing Workshop 2025, 2025, pp. 281–290

work page 2025
[11]

N. F. Liu, K. Lin, J. Hewitt, A. Paranjape, M. Bevilacqua, F. Petroni, P. Liang, Lost in the middle: How language models use long contexts, 2023, URL https://arxiv. org/abs/2307.03172 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[12]

Zhang, J

L. Zhang, J. Savelka, K. Ashley, Do llms truly understand when a precedent is overruled?, arXiv preprint arXiv:2510.20941 (2025)

work page arXiv 2025
[13]

Cover, P

T. Cover, P. Hart, Nearest neighbor pattern classification, IEEE transactions on information theory 13 (1967) 21–27

work page 1967
[14]

X. Chi, W. Zhong, Y. Wu, W. Wang, K. Kuang, F. Wu, M. Xiong, Universal legal article prediction via tight collaboration between supervised classification model and llm, in: Proceedings of the Twentieth International Conference on Artificial Intelligence and Law, 2025, pp. 21–30

work page 2025
[15]

Hendrycks, C

D. Hendrycks, C. Burns, A. Chen, S. Ball, Cuad: An expert-annotated nlp dataset for legal contract review, arXiv preprint arXiv:2103.06268 (2021)

work page arXiv 2021
[16]

N. Guha, J. Nyarko, D. Ho, C. Ré, A. Chilton, A. Chohlas-Wood, A. Peters, B. Waldon, D. Rockmore, D. Zambrano, et al., Legalbench: A collaboratively built benchmark for measuring legal reasoning in large language models, Advances in neural information processing systems 36 (2023) 44123– 44279

work page 2023
[17]

N. Wais, M. Grabmair, Learning from computer vision: The effects of loss functions on legal text classification with class imbalance, in: Proceedings of the Twentieth International Conference on Artificial Intelligence and Law, 2025, pp. 41–50

work page 2025
[18]

M. Gray, J. Savelka, W. Oliver, K. Ashley, Using llms to discover legal factors, arXiv preprint arXiv:2410.07504 (2024)

work page arXiv 2024
[19]

K. Luo, Q. Huang, C. Jiang, Y. Feng, Automating legal interpretation with llms: Retrieval, generation, and evaluation, arXiv preprint arXiv:2501.01743 (2025)

work page arXiv 2025
[20]

J. Wei, X. Wang, D. Schuurmans, M. Bosma, F. Xia, E. Chi, Q. V. Le, D. Zhou, et al., Chain-of-thought prompting elicits reasoning in large language models, Advances in neural information processing systems 35 (2022) 24824–24837

work page 2022
[21]

S. Yao, D. Yu, J. Zhao, I. Shafran, T. Griffiths, Y. Cao, K. Narasimhan, Tree of thoughts: Deliberate problem solving with large language models, Advances in neural information processing systems 36 (2023) 11809–11822

work page 2023
[22]

Lewis, E

P. Lewis, E. Perez, A. Piktus, F. Petroni, V. Karpukhin, N. Goyal, H. Küttler, M. Lewis, W.-t. Yih, T. Rocktäschel, et al., Retrieval-augmented generation for knowledge-intensive nlp tasks, Advances in neural information processing systems 33 (2020) 9459–9474

work page 2020
[23]

Bareham, K

D. Bareham, K. Atkinson, J. Mumford, J. Marshall, Curb your enthusiasm: Towards a rag framework to forecast case importance in the echr, in: Proceedings of the Twentieth International Conference on Artificial Intelligence and Law, 2025, pp. 31–40

work page 2025
[24]

A. Yang, A. Li, B. Yang, B. Zhang, B. Hui, B. Zheng, B. Yu, C. Gao, C. Huang, C. Lv, et al., Qwen3 technical report, arXiv preprint arXiv:2505.09388 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[25]

Karpukhin, B

V. Karpukhin, B. Oguz, S. Min, P. S. Lewis, L. Wu, S. Edunov, D. Chen, W.-t. Yih, Dense passage retrieval for open-domain question answering., in: EMNLP (1), 2020, pp. 6769–6781

work page 2020
[26]

Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks

N. Reimers, I. Gurevych, Sentence-bert: Sentence embeddings using siamese bert-networks, arXiv preprint arXiv:1908.10084 (2019)

work page internal anchor Pith review Pith/arXiv arXiv 1908
[27]

T. Gao, X. Yao, D. Chen, Simcse: Simple contrastive learning of sentence embeddings, arXiv preprint arXiv:2104.08821 (2021)

work page internal anchor Pith review Pith/arXiv arXiv 2021
[28]

Johnson, M

J. Johnson, M. Douze, H. Jégou, Billion-scale similarity search with gpus, IEEE Transactions on Big Data 7 (2019) 535–547

work page 2019
[29]

Small Language Models are the Future of Agentic AI

P. Belcak, G. Heinrich, S. Diao, Y. Fu, X. Dong, S. Muralidharan, Y. C. Lin, P. Molchanov, Small language models are the future of agentic ai, arXiv preprint arXiv:2506.02153 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[30]

M. Gray, L. Zhang, K. D. Ashley, Generating case-based legal arguments with llms, in: Proceedings of the 2025 Symposium on Computer Science and Law, 2025, pp. 160–168

work page 2025
[31]

Zhang, K

L. Zhang, K. D. Ashley, Mitigating manipulation and enhancing persuasion: A reflective multi- agent approach for legal argument generation, arXiv preprint arXiv:2506.02992 (2025)

work page arXiv 2025
[32]

Goebel, Y

R. Goebel, Y. Kano, M.-Y. Kim, C. Kwan, K. Satoh, H. Yamada, M. Yoshioka, An overview of the COLIEE 2025 competition: Legal case law and statute law information retrieval and entailment, in: Proceedings of the Twentieth International Conference on Artificial Intelligence and Law, ICAIL 2025, ACM, 2025, pp. 506–515. doi:10.1145/3769126.3785016

work page doi:10.1145/3769126.3785016 2025
[33]

C. Xiao, H. Zhong, Z. Guo, C. Tu, Z. Liu, M. Sun, Y. Feng, X. Han, Z. Hu, H. Wang, et al., Cail2018: A large-scale legal dataset for judgment prediction, arXiv preprint arXiv:1807.02478 (2018)

work page internal anchor Pith review Pith/arXiv arXiv 2018
[34]

Scaling Laws for Neural Language Models

J. Kaplan, S. McCandlish, T. Henighan, T. B. Brown, B. Chess, R. Child, S. Gray, A. Radford, J. Wu, D. Amodei, Scaling laws for neural language models, arXiv preprint arXiv:2001.08361 (2020)

work page internal anchor Pith review Pith/arXiv arXiv 2001
[35]

Hoffmann, S

J. Hoffmann, S. Borgeaud, A. Mensch, E. Buchatskaya, T. Cai, E. Rutherford, D. de Las Casas, L. A. Hendricks, J. Welbl, A. Clark, et al., An empirical analysis of compute-optimal large language model training, Advances in neural information processing systems 35 (2022) 30016–30030

work page 2022
[36]

Explainable Prediction of Medical Codes from Clinical Text

J. Mullenbach, S. Wiegreffe, J. Duke, J. Sun, J. Eisenstein, Explainable prediction of medical codes from clinical text, arXiv preprint arXiv:1802.05695 (2018)

work page internal anchor Pith review Pith/arXiv arXiv 2018
[37]

Tuggener, P

D. Tuggener, P. Von Däniken, T. Peetz, M. Cieliebak, Ledgar: a large-scale multi-label corpus for text classification of legal provisions in contracts, in: Proceedings of the twelfth language resources and evaluation conference, 2020, pp. 1235–1241. 100 500 1000 2000 4500 9000 Training Samples 0.1 0.2 0.3 0.4 0.5 0.6Macro F1 ECtHR A: Macro F1 BERT-H Full-...

work page 2020

[1] [1]

K. D. Ashley, Artificial intelligence and legal analytics: new tools for law practice in the digital age, Cambridge University Press, 2017

work page 2017

[2] [2]

Chalkidis, A

I. Chalkidis, A. Jana, D. Hartung, M. Bommarito, I. Androutsopoulos, D. Katz, N. Aletras, Lexglue: A benchmark dataset for legal language understanding in english, in: Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2022, pp. 4310–4330

work page 2022

[3] [3]

Aletras, D

N. Aletras, D. Tsarapatsanis, D. Preoţiuc-Pietro, V. Lampos, Predicting judicial decisions of the european court of human rights: A natural language processing perspective, PeerJ computer science 2 (2016) e93

work page 2016

[4] [4]

Chalkidis, E

I. Chalkidis, E. Fergadiotis, P. Malakasiotis, I. Androutsopoulos, Large-scale multi-label text classification on eu legislation, in: Proceedings of the 57th annual meeting of the association for computational linguistics, 2019, pp. 6314–6322

work page 2019

[5] [5]

Chang, H.-F

W.-C. Chang, H.-F. Yu, K. Zhong, Y. Yang, I. S. Dhillon, Taming pretrained transformers for extreme multi-label text classification, in: Proceedings of the 26th ACM SIGKDD international conference on knowledge discovery & data mining, 2020, pp. 3163–3171

work page 2020

[6] [6]

Chalkidis, M

I. Chalkidis, M. Fergadiotis, I. Androutsopoulos, Multieurlex–a multi-lingual and multi-label legal document classification dataset for zero-shot cross-lingual transfer, arXiv preprint arXiv:2109.00904 (2021)

work page arXiv 2021

[7] [7]

arXiv , eprintclass =

I. Chalkidis, M. Fergadiotis, P. Malakasiotis, N. Aletras, I. Androutsopoulos, Legal-bert: The muppets straight out of law school, arXiv preprint arXiv:2010.02559 (2020)

work page arXiv 2010

[8] [8]

Brown, B

T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, et al., Language models are few-shot learners, Advances in neural information processing systems 33 (2020) 1877–1901

work page 2020

[9] [9]

Savelka, K

J. Savelka, K. D. Ashley, The unreasonable effectiveness of large language models in zero-shot semantic annotation of legal texts, Frontiers in Artificial Intelligence 6 (2023) 1279794

work page 2023

[10] [10]

H. Lee, K. C. Li, M. Grabmair, S. Xu, Efficient prompt optimisation for legal text classification with proxy prompt evaluator, in: Proceedings of the Natural Legal Language Processing Workshop 2025, 2025, pp. 281–290

work page 2025

[11] [11]

N. F. Liu, K. Lin, J. Hewitt, A. Paranjape, M. Bevilacqua, F. Petroni, P. Liang, Lost in the middle: How language models use long contexts, 2023, URL https://arxiv. org/abs/2307.03172 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023

[12] [12]

Zhang, J

L. Zhang, J. Savelka, K. Ashley, Do llms truly understand when a precedent is overruled?, arXiv preprint arXiv:2510.20941 (2025)

work page arXiv 2025

[13] [13]

Cover, P

T. Cover, P. Hart, Nearest neighbor pattern classification, IEEE transactions on information theory 13 (1967) 21–27

work page 1967

[14] [14]

X. Chi, W. Zhong, Y. Wu, W. Wang, K. Kuang, F. Wu, M. Xiong, Universal legal article prediction via tight collaboration between supervised classification model and llm, in: Proceedings of the Twentieth International Conference on Artificial Intelligence and Law, 2025, pp. 21–30

work page 2025

[15] [15]

Hendrycks, C

D. Hendrycks, C. Burns, A. Chen, S. Ball, Cuad: An expert-annotated nlp dataset for legal contract review, arXiv preprint arXiv:2103.06268 (2021)

work page arXiv 2021

[16] [16]

N. Guha, J. Nyarko, D. Ho, C. Ré, A. Chilton, A. Chohlas-Wood, A. Peters, B. Waldon, D. Rockmore, D. Zambrano, et al., Legalbench: A collaboratively built benchmark for measuring legal reasoning in large language models, Advances in neural information processing systems 36 (2023) 44123– 44279

work page 2023

[17] [17]

N. Wais, M. Grabmair, Learning from computer vision: The effects of loss functions on legal text classification with class imbalance, in: Proceedings of the Twentieth International Conference on Artificial Intelligence and Law, 2025, pp. 41–50

work page 2025

[18] [18]

M. Gray, J. Savelka, W. Oliver, K. Ashley, Using llms to discover legal factors, arXiv preprint arXiv:2410.07504 (2024)

work page arXiv 2024

[19] [19]

K. Luo, Q. Huang, C. Jiang, Y. Feng, Automating legal interpretation with llms: Retrieval, generation, and evaluation, arXiv preprint arXiv:2501.01743 (2025)

work page arXiv 2025

[20] [20]

J. Wei, X. Wang, D. Schuurmans, M. Bosma, F. Xia, E. Chi, Q. V. Le, D. Zhou, et al., Chain-of-thought prompting elicits reasoning in large language models, Advances in neural information processing systems 35 (2022) 24824–24837

work page 2022

[21] [21]

S. Yao, D. Yu, J. Zhao, I. Shafran, T. Griffiths, Y. Cao, K. Narasimhan, Tree of thoughts: Deliberate problem solving with large language models, Advances in neural information processing systems 36 (2023) 11809–11822

work page 2023

[22] [22]

Lewis, E

P. Lewis, E. Perez, A. Piktus, F. Petroni, V. Karpukhin, N. Goyal, H. Küttler, M. Lewis, W.-t. Yih, T. Rocktäschel, et al., Retrieval-augmented generation for knowledge-intensive nlp tasks, Advances in neural information processing systems 33 (2020) 9459–9474

work page 2020

[23] [23]

Bareham, K

D. Bareham, K. Atkinson, J. Mumford, J. Marshall, Curb your enthusiasm: Towards a rag framework to forecast case importance in the echr, in: Proceedings of the Twentieth International Conference on Artificial Intelligence and Law, 2025, pp. 31–40

work page 2025

[24] [24]

A. Yang, A. Li, B. Yang, B. Zhang, B. Hui, B. Zheng, B. Yu, C. Gao, C. Huang, C. Lv, et al., Qwen3 technical report, arXiv preprint arXiv:2505.09388 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025

[25] [25]

Karpukhin, B

V. Karpukhin, B. Oguz, S. Min, P. S. Lewis, L. Wu, S. Edunov, D. Chen, W.-t. Yih, Dense passage retrieval for open-domain question answering., in: EMNLP (1), 2020, pp. 6769–6781

work page 2020

[26] [26]

Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks

N. Reimers, I. Gurevych, Sentence-bert: Sentence embeddings using siamese bert-networks, arXiv preprint arXiv:1908.10084 (2019)

work page internal anchor Pith review Pith/arXiv arXiv 1908

[27] [27]

T. Gao, X. Yao, D. Chen, Simcse: Simple contrastive learning of sentence embeddings, arXiv preprint arXiv:2104.08821 (2021)

work page internal anchor Pith review Pith/arXiv arXiv 2021

[28] [28]

Johnson, M

J. Johnson, M. Douze, H. Jégou, Billion-scale similarity search with gpus, IEEE Transactions on Big Data 7 (2019) 535–547

work page 2019

[29] [29]

Small Language Models are the Future of Agentic AI

P. Belcak, G. Heinrich, S. Diao, Y. Fu, X. Dong, S. Muralidharan, Y. C. Lin, P. Molchanov, Small language models are the future of agentic ai, arXiv preprint arXiv:2506.02153 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025

[30] [30]

M. Gray, L. Zhang, K. D. Ashley, Generating case-based legal arguments with llms, in: Proceedings of the 2025 Symposium on Computer Science and Law, 2025, pp. 160–168

work page 2025

[31] [31]

Zhang, K

L. Zhang, K. D. Ashley, Mitigating manipulation and enhancing persuasion: A reflective multi- agent approach for legal argument generation, arXiv preprint arXiv:2506.02992 (2025)

work page arXiv 2025

[32] [32]

Goebel, Y

R. Goebel, Y. Kano, M.-Y. Kim, C. Kwan, K. Satoh, H. Yamada, M. Yoshioka, An overview of the COLIEE 2025 competition: Legal case law and statute law information retrieval and entailment, in: Proceedings of the Twentieth International Conference on Artificial Intelligence and Law, ICAIL 2025, ACM, 2025, pp. 506–515. doi:10.1145/3769126.3785016

work page doi:10.1145/3769126.3785016 2025

[33] [33]

C. Xiao, H. Zhong, Z. Guo, C. Tu, Z. Liu, M. Sun, Y. Feng, X. Han, Z. Hu, H. Wang, et al., Cail2018: A large-scale legal dataset for judgment prediction, arXiv preprint arXiv:1807.02478 (2018)

work page internal anchor Pith review Pith/arXiv arXiv 2018

[34] [34]

Scaling Laws for Neural Language Models

J. Kaplan, S. McCandlish, T. Henighan, T. B. Brown, B. Chess, R. Child, S. Gray, A. Radford, J. Wu, D. Amodei, Scaling laws for neural language models, arXiv preprint arXiv:2001.08361 (2020)

work page internal anchor Pith review Pith/arXiv arXiv 2001

[35] [35]

Hoffmann, S

J. Hoffmann, S. Borgeaud, A. Mensch, E. Buchatskaya, T. Cai, E. Rutherford, D. de Las Casas, L. A. Hendricks, J. Welbl, A. Clark, et al., An empirical analysis of compute-optimal large language model training, Advances in neural information processing systems 35 (2022) 30016–30030

work page 2022

[36] [36]

Explainable Prediction of Medical Codes from Clinical Text

J. Mullenbach, S. Wiegreffe, J. Duke, J. Sun, J. Eisenstein, Explainable prediction of medical codes from clinical text, arXiv preprint arXiv:1802.05695 (2018)

work page internal anchor Pith review Pith/arXiv arXiv 2018

[37] [37]

Tuggener, P

D. Tuggener, P. Von Däniken, T. Peetz, M. Cieliebak, Ledgar: a large-scale multi-label corpus for text classification of legal provisions in contracts, in: Proceedings of the twelfth language resources and evaluation conference, 2020, pp. 1235–1241. 100 500 1000 2000 4500 9000 Training Samples 0.1 0.2 0.3 0.4 0.5 0.6Macro F1 ECtHR A: Macro F1 BERT-H Full-...

work page 2020