Retrieval-Based Multi-Label Legal Annotation: Extensible, Data-Efficient and Hallucination-Free
Pith reviewed 2026-05-19 21:36 UTC · model grok-4.3
The pith
Retrieval in a frozen embedding space assigns multiple legal labels to documents with competitive accuracy, strong data efficiency, and no risk of hallucinating outside the taxonomy.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The authors show that embedding documents and candidate label texts with a frozen retrieval model and predicting via k-nearest neighbors produces accurate multi-label annotations on ECtHR-A, ECtHR-B, and Eurlex. On Eurlex the approach reaches Macro-F1 of 49.12 versus 40.41 for zero-shot GPT-5.2; with only 100 training samples it nearly doubles the Micro-F1 of hierarchical Legal-BERT on ECtHR-A. Because labels are drawn exclusively from the indexed set, the method never produces labels outside the taxonomy.
What carries the argument
k-nearest-neighbor lookup between document embeddings and label-description embeddings produced by a frozen retrieval model.
If this is right
- New labels can be added by embedding their descriptions and updating the index with no gradient-based retraining required.
- Every prediction is drawn only from the supplied taxonomy, eliminating hallucination outside the defined label set.
- Competitive accuracy is achieved with as few as 100 labeled training documents on complex legal tasks.
- Estimated compute drops by a factor of 20-30 relative to fine-tuning large generative models.
Where Pith is reading between the lines
- The same frozen-retrieval pattern could be applied directly to other domains that maintain large, frequently updated label sets such as medical coding or regulatory compliance.
- A hybrid pipeline could use retrieval to select the label set and then invoke a small generative model only to produce human-readable justifications.
- Gains would likely widen as stronger general-purpose retrieval models become available without any legal-domain fine-tuning.
Load-bearing premise
Similarity in the frozen retrieval embedding space reliably indicates whether a label applies to a long, fact-intensive legal document without any task-specific adaptation of the embedder.
What would settle it
A test set of legal documents whose correct labels depend on subtle statutory distinctions that general semantic embeddings do not capture, causing retrieval F1 to fall below fine-tuned baselines despite ample data.
Figures
read the original abstract
Multi-label legal annotation requires assigning multiple labels from large, evolving taxonomies to long, fact-intensive documents, often under limited supervision. Parametric encoders typically require task-specific training and retraining when the label set changes, while prompting generative large language models becomes costly and degrades as the label space grows. We cast legal annotation as retrieval: we embed documents and label descriptions with a frozen retrieval model and predict labels via k-nearest neighbors in the embedding space, enabling updates by re-embedding and re-indexing rather than gradient-based backpropagation. Across three legal datasets (ECtHR-A, ECtHR-B, and Eurlex with 100 labels), retrieval achieves competitive accuracy and strong data efficiency; on Eurlex, Qwen-8B retrieval improves Macro-F1 from 40.41 (GPT-5.2, zero-shot) to 49.12 while reducing estimated compute by 20-30 times compared to fine-tuning. With only (N=100) training samples, retrieval nearly doubles Micro-F1 over hierarchical Legal-BERT on ECtHR-A (48.29 vs. 27.87). We also quantify a reliability failure mode of generative inference: GPT-5.2 hallucinates labels outside the provided taxonomy in 0.12-0.9% of test samples under deterministic decoding. In contrast, retrieval strictly respects defined label sets, eliminating hallucination by design. These results suggest retrieval-model-based annotators are a practical, deployable alternative for high-cardinality and rapidly changing legal label spaces.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes casting multi-label legal annotation as a kNN retrieval task in the embedding space of a frozen retrieval model applied to documents and label descriptions. This enables extensible updates via re-indexing rather than retraining. Experiments on ECtHR-A, ECtHR-B, and Eurlex (100 labels) report competitive Macro- and Micro-F1 scores, strong data efficiency (e.g., N=100 samples nearly doubling Micro-F1 over hierarchical Legal-BERT on ECtHR-A), 20-30x compute savings versus fine-tuning, and zero hallucinations by construction versus GPT-5.2 zero-shot prompting.
Significance. If the central results hold, the work demonstrates a practical, hallucination-free alternative for high-cardinality and evolving legal taxonomies that avoids gradient updates and scales with limited supervision. The quantified data-efficiency gains and explicit contrast to generative hallucination rates provide a clear deployability argument for resource-constrained legal annotation pipelines.
major comments (3)
- [Approach / Methods] The central claim that frozen-embedding kNN reliably identifies label applicability for long, fact-intensive legal documents rests on an untested assumption; no ablation compares the frozen embedder against task-adapted or fine-tuned variants, which is load-bearing for the assertion that retrieval is sufficient without domain-specific adaptation.
- [Experimental Setup] k is listed as the sole free parameter, yet no concrete value, selection procedure, or sensitivity analysis is reported; this directly affects reproducibility of the claimed F1 improvements (e.g., Eurlex Macro-F1 49.12 and ECtHR-A Micro-F1 48.29 at N=100).
- [Results / Experiments] Table or results section reporting the GPT-5.2 and Legal-BERT baselines lacks dataset splits, exact training details, and error bars; without these it is impossible to verify whether the reported gains (40.41 to 49.12 Macro-F1 on Eurlex; 27.87 to 48.29 Micro-F1 on ECtHR-A) are robust to post-hoc implementation choices.
minor comments (2)
- [Abstract / Experiments] Clarify the exact model referred to as 'GPT-5.2' and whether deterministic decoding was used uniformly across all generative baselines.
- [Approach] Add a brief discussion of how label-description embeddings are constructed for the Eurlex 100-label taxonomy to aid replication.
Simulated Author's Rebuttal
We are grateful to the referee for the constructive comments that will help improve the clarity and reproducibility of our work. We address each major comment below and outline the revisions we plan to make.
read point-by-point responses
-
Referee: [Approach / Methods] The central claim that frozen-embedding kNN reliably identifies label applicability for long, fact-intensive legal documents rests on an untested assumption; no ablation compares the frozen embedder against task-adapted or fine-tuned variants, which is load-bearing for the assertion that retrieval is sufficient without domain-specific adaptation.
Authors: We thank the referee for highlighting this point. The manuscript intentionally focuses on the frozen setting to emphasize data efficiency, extensibility without retraining, and avoidance of hallucinations. The competitive results across datasets (e.g., outperforming GPT-5.2 zero-shot and hierarchical Legal-BERT with limited data) serve as empirical validation that frozen embeddings are sufficient for this task. A comparison to fine-tuned variants would be valuable but would require substantial additional experiments and compute, which we believe is beyond the current scope given the paper's emphasis on the no-adaptation advantage. We will add a paragraph in the discussion section to explicitly address this design decision and its relation to the claims. revision: no
-
Referee: [Experimental Setup] k is listed as the sole free parameter, yet no concrete value, selection procedure, or sensitivity analysis is reported; this directly affects reproducibility of the claimed F1 improvements (e.g., Eurlex Macro-F1 49.12 and ECtHR-A Micro-F1 48.29 at N=100).
Authors: We agree that this information is necessary for reproducibility. In the revised manuscript, we will report the specific value of k used in all experiments, describe how it was selected (e.g., via a small validation set), and include a sensitivity analysis plotting performance across a range of k values to demonstrate robustness. revision: yes
-
Referee: [Results / Experiments] Table or results section reporting the GPT-5.2 and Legal-BERT baselines lacks dataset splits, exact training details, and error bars; without these it is impossible to verify whether the reported gains (40.41 to 49.12 Macro-F1 on Eurlex; 27.87 to 48.29 Micro-F1 on ECtHR-A) are robust to post-hoc implementation choices.
Authors: We acknowledge the need for greater transparency in the baseline implementations. The revised version will include a detailed description of the dataset splits, the exact hyperparameters and training procedures for Legal-BERT, the prompting strategy for GPT-5.2, and error bars computed over multiple random seeds or runs where feasible. This will strengthen the verifiability of the reported improvements. revision: yes
Circularity Check
No significant circularity; empirical results on external benchmarks
full rationale
The paper proposes casting multi-label legal annotation as kNN retrieval over frozen document and label-description embeddings, then reports accuracy, data-efficiency, and hallucination metrics from direct experiments on three public datasets (ECtHR-A, ECtHR-B, Eurlex) against fixed external baselines such as GPT-5.2 zero-shot and hierarchical Legal-BERT. No equations, parameters, or uniqueness claims reduce by construction to quantities defined inside the paper; the central performance numbers are measured on held-out test splits and are therefore independent of any internal fit or self-referential definition.
Axiom & Free-Parameter Ledger
free parameters (1)
- k (nearest neighbors)
axioms (1)
- domain assumption Embedding-space similarity between document and label description vectors indicates label applicability for legal texts.
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We cast legal annotation as retrieval: we embed documents and label descriptions with a frozen retrieval model and predict labels via k-nearest neighbors in the embedding space
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
K. D. Ashley, Artificial intelligence and legal analytics: new tools for law practice in the digital age, Cambridge University Press, 2017
work page 2017
-
[2]
I. Chalkidis, A. Jana, D. Hartung, M. Bommarito, I. Androutsopoulos, D. Katz, N. Aletras, Lexglue: A benchmark dataset for legal language understanding in english, in: Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2022, pp. 4310–4330
work page 2022
-
[3]
N. Aletras, D. Tsarapatsanis, D. Preoţiuc-Pietro, V. Lampos, Predicting judicial decisions of the european court of human rights: A natural language processing perspective, PeerJ computer science 2 (2016) e93
work page 2016
-
[4]
I. Chalkidis, E. Fergadiotis, P. Malakasiotis, I. Androutsopoulos, Large-scale multi-label text classification on eu legislation, in: Proceedings of the 57th annual meeting of the association for computational linguistics, 2019, pp. 6314–6322
work page 2019
-
[5]
W.-C. Chang, H.-F. Yu, K. Zhong, Y. Yang, I. S. Dhillon, Taming pretrained transformers for extreme multi-label text classification, in: Proceedings of the 26th ACM SIGKDD international conference on knowledge discovery & data mining, 2020, pp. 3163–3171
work page 2020
-
[6]
I. Chalkidis, M. Fergadiotis, I. Androutsopoulos, Multieurlex–a multi-lingual and multi-label legal document classification dataset for zero-shot cross-lingual transfer, arXiv preprint arXiv:2109.00904 (2021)
-
[7]
I. Chalkidis, M. Fergadiotis, P. Malakasiotis, N. Aletras, I. Androutsopoulos, Legal-bert: The muppets straight out of law school, arXiv preprint arXiv:2010.02559 (2020)
- [8]
-
[9]
J. Savelka, K. D. Ashley, The unreasonable effectiveness of large language models in zero-shot semantic annotation of legal texts, Frontiers in Artificial Intelligence 6 (2023) 1279794
work page 2023
-
[10]
H. Lee, K. C. Li, M. Grabmair, S. Xu, Efficient prompt optimisation for legal text classification with proxy prompt evaluator, in: Proceedings of the Natural Legal Language Processing Workshop 2025, 2025, pp. 281–290
work page 2025
-
[11]
N. F. Liu, K. Lin, J. Hewitt, A. Paranjape, M. Bevilacqua, F. Petroni, P. Liang, Lost in the middle: How language models use long contexts, 2023, URL https://arxiv. org/abs/2307.03172 (2023)
work page internal anchor Pith review Pith/arXiv arXiv 2023
- [12]
- [13]
-
[14]
X. Chi, W. Zhong, Y. Wu, W. Wang, K. Kuang, F. Wu, M. Xiong, Universal legal article prediction via tight collaboration between supervised classification model and llm, in: Proceedings of the Twentieth International Conference on Artificial Intelligence and Law, 2025, pp. 21–30
work page 2025
-
[15]
D. Hendrycks, C. Burns, A. Chen, S. Ball, Cuad: An expert-annotated nlp dataset for legal contract review, arXiv preprint arXiv:2103.06268 (2021)
-
[16]
N. Guha, J. Nyarko, D. Ho, C. Ré, A. Chilton, A. Chohlas-Wood, A. Peters, B. Waldon, D. Rockmore, D. Zambrano, et al., Legalbench: A collaboratively built benchmark for measuring legal reasoning in large language models, Advances in neural information processing systems 36 (2023) 44123– 44279
work page 2023
-
[17]
N. Wais, M. Grabmair, Learning from computer vision: The effects of loss functions on legal text classification with class imbalance, in: Proceedings of the Twentieth International Conference on Artificial Intelligence and Law, 2025, pp. 41–50
work page 2025
- [18]
- [19]
-
[20]
J. Wei, X. Wang, D. Schuurmans, M. Bosma, F. Xia, E. Chi, Q. V. Le, D. Zhou, et al., Chain-of-thought prompting elicits reasoning in large language models, Advances in neural information processing systems 35 (2022) 24824–24837
work page 2022
-
[21]
S. Yao, D. Yu, J. Zhao, I. Shafran, T. Griffiths, Y. Cao, K. Narasimhan, Tree of thoughts: Deliberate problem solving with large language models, Advances in neural information processing systems 36 (2023) 11809–11822
work page 2023
- [22]
-
[23]
D. Bareham, K. Atkinson, J. Mumford, J. Marshall, Curb your enthusiasm: Towards a rag framework to forecast case importance in the echr, in: Proceedings of the Twentieth International Conference on Artificial Intelligence and Law, 2025, pp. 31–40
work page 2025
-
[24]
A. Yang, A. Li, B. Yang, B. Zhang, B. Hui, B. Zheng, B. Yu, C. Gao, C. Huang, C. Lv, et al., Qwen3 technical report, arXiv preprint arXiv:2505.09388 (2025)
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[25]
V. Karpukhin, B. Oguz, S. Min, P. S. Lewis, L. Wu, S. Edunov, D. Chen, W.-t. Yih, Dense passage retrieval for open-domain question answering., in: EMNLP (1), 2020, pp. 6769–6781
work page 2020
-
[26]
Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks
N. Reimers, I. Gurevych, Sentence-bert: Sentence embeddings using siamese bert-networks, arXiv preprint arXiv:1908.10084 (2019)
work page internal anchor Pith review Pith/arXiv arXiv 1908
-
[27]
T. Gao, X. Yao, D. Chen, Simcse: Simple contrastive learning of sentence embeddings, arXiv preprint arXiv:2104.08821 (2021)
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[28]
J. Johnson, M. Douze, H. Jégou, Billion-scale similarity search with gpus, IEEE Transactions on Big Data 7 (2019) 535–547
work page 2019
-
[29]
Small Language Models are the Future of Agentic AI
P. Belcak, G. Heinrich, S. Diao, Y. Fu, X. Dong, S. Muralidharan, Y. C. Lin, P. Molchanov, Small language models are the future of agentic ai, arXiv preprint arXiv:2506.02153 (2025)
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[30]
M. Gray, L. Zhang, K. D. Ashley, Generating case-based legal arguments with llms, in: Proceedings of the 2025 Symposium on Computer Science and Law, 2025, pp. 160–168
work page 2025
- [31]
-
[32]
R. Goebel, Y. Kano, M.-Y. Kim, C. Kwan, K. Satoh, H. Yamada, M. Yoshioka, An overview of the COLIEE 2025 competition: Legal case law and statute law information retrieval and entailment, in: Proceedings of the Twentieth International Conference on Artificial Intelligence and Law, ICAIL 2025, ACM, 2025, pp. 506–515. doi:10.1145/3769126.3785016
-
[33]
C. Xiao, H. Zhong, Z. Guo, C. Tu, Z. Liu, M. Sun, Y. Feng, X. Han, Z. Hu, H. Wang, et al., Cail2018: A large-scale legal dataset for judgment prediction, arXiv preprint arXiv:1807.02478 (2018)
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[34]
Scaling Laws for Neural Language Models
J. Kaplan, S. McCandlish, T. Henighan, T. B. Brown, B. Chess, R. Child, S. Gray, A. Radford, J. Wu, D. Amodei, Scaling laws for neural language models, arXiv preprint arXiv:2001.08361 (2020)
work page internal anchor Pith review Pith/arXiv arXiv 2001
-
[35]
J. Hoffmann, S. Borgeaud, A. Mensch, E. Buchatskaya, T. Cai, E. Rutherford, D. de Las Casas, L. A. Hendricks, J. Welbl, A. Clark, et al., An empirical analysis of compute-optimal large language model training, Advances in neural information processing systems 35 (2022) 30016–30030
work page 2022
-
[36]
Explainable Prediction of Medical Codes from Clinical Text
J. Mullenbach, S. Wiegreffe, J. Duke, J. Sun, J. Eisenstein, Explainable prediction of medical codes from clinical text, arXiv preprint arXiv:1802.05695 (2018)
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[37]
D. Tuggener, P. Von Däniken, T. Peetz, M. Cieliebak, Ledgar: a large-scale multi-label corpus for text classification of legal provisions in contracts, in: Proceedings of the twelfth language resources and evaluation conference, 2020, pp. 1235–1241. 100 500 1000 2000 4500 9000 Training Samples 0.1 0.2 0.3 0.4 0.5 0.6Macro F1 ECtHR A: Macro F1 BERT-H Full-...
work page 2020
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.