pith. sign in

arxiv: 2605.16767 · v1 · pith:W35KERMQnew · submitted 2026-05-16 · 💻 cs.CL

Retrieval-Based Multi-Label Legal Annotation: Extensible, Data-Efficient and Hallucination-Free

Pith reviewed 2026-05-19 21:36 UTC · model grok-4.3

classification 💻 cs.CL
keywords multi-label annotationlegal documentsretrieval modelsdata efficiencyhallucination preventionnearest neighborlarge language models
0
0 comments X

The pith

Retrieval in a frozen embedding space assigns multiple legal labels to documents with competitive accuracy, strong data efficiency, and no risk of hallucinating outside the taxonomy.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper reframes multi-label legal annotation as retrieval: both long documents and label descriptions are embedded by a fixed model, after which labels are assigned by nearest-neighbor lookup in the shared space. This design removes the need to retrain when taxonomies change and guarantees that every prediction stays inside the supplied label set. Experiments across three legal datasets show the method matching or beating zero-shot generative models and fine-tuned encoders while using far less compute and far fewer training examples.

Core claim

The authors show that embedding documents and candidate label texts with a frozen retrieval model and predicting via k-nearest neighbors produces accurate multi-label annotations on ECtHR-A, ECtHR-B, and Eurlex. On Eurlex the approach reaches Macro-F1 of 49.12 versus 40.41 for zero-shot GPT-5.2; with only 100 training samples it nearly doubles the Micro-F1 of hierarchical Legal-BERT on ECtHR-A. Because labels are drawn exclusively from the indexed set, the method never produces labels outside the taxonomy.

What carries the argument

k-nearest-neighbor lookup between document embeddings and label-description embeddings produced by a frozen retrieval model.

If this is right

  • New labels can be added by embedding their descriptions and updating the index with no gradient-based retraining required.
  • Every prediction is drawn only from the supplied taxonomy, eliminating hallucination outside the defined label set.
  • Competitive accuracy is achieved with as few as 100 labeled training documents on complex legal tasks.
  • Estimated compute drops by a factor of 20-30 relative to fine-tuning large generative models.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same frozen-retrieval pattern could be applied directly to other domains that maintain large, frequently updated label sets such as medical coding or regulatory compliance.
  • A hybrid pipeline could use retrieval to select the label set and then invoke a small generative model only to produce human-readable justifications.
  • Gains would likely widen as stronger general-purpose retrieval models become available without any legal-domain fine-tuning.

Load-bearing premise

Similarity in the frozen retrieval embedding space reliably indicates whether a label applies to a long, fact-intensive legal document without any task-specific adaptation of the embedder.

What would settle it

A test set of legal documents whose correct labels depend on subtle statutory distinctions that general semantic embeddings do not capture, causing retrieval F1 to fall below fine-tuned baselines despite ample data.

Figures

Figures reproduced from arXiv: 2605.16767 by Jaromir Savelka, Kevin Ashley, Li Zhang.

Figure 1
Figure 1. Figure 1: Comparison of Legal Annotation Paradigms (Inference Time). (a) Parametric Fine-tuning (BERT): Requires updating model weights, data-hungry, rigid. (b) Generative Zero-shot (GPT-5.2): Context window limited, expensive, slow. (c) Proposed Retrieval Model (Qwen-3 Embedding): Retrieval-based, plug-and-play, handles a large and evolving set of labels efficiently. task-specific fine-tuning [7]. However, they are… view at source ↗
Figure 3
Figure 3. Figure 3: Performance scaling of Legal-BERT vs. Qwen Retrieval Models across varying training sample sizes [PITH_FULL_IMAGE:figures/full_fig_p010_3.png] view at source ↗
read the original abstract

Multi-label legal annotation requires assigning multiple labels from large, evolving taxonomies to long, fact-intensive documents, often under limited supervision. Parametric encoders typically require task-specific training and retraining when the label set changes, while prompting generative large language models becomes costly and degrades as the label space grows. We cast legal annotation as retrieval: we embed documents and label descriptions with a frozen retrieval model and predict labels via k-nearest neighbors in the embedding space, enabling updates by re-embedding and re-indexing rather than gradient-based backpropagation. Across three legal datasets (ECtHR-A, ECtHR-B, and Eurlex with 100 labels), retrieval achieves competitive accuracy and strong data efficiency; on Eurlex, Qwen-8B retrieval improves Macro-F1 from 40.41 (GPT-5.2, zero-shot) to 49.12 while reducing estimated compute by 20-30 times compared to fine-tuning. With only (N=100) training samples, retrieval nearly doubles Micro-F1 over hierarchical Legal-BERT on ECtHR-A (48.29 vs. 27.87). We also quantify a reliability failure mode of generative inference: GPT-5.2 hallucinates labels outside the provided taxonomy in 0.12-0.9% of test samples under deterministic decoding. In contrast, retrieval strictly respects defined label sets, eliminating hallucination by design. These results suggest retrieval-model-based annotators are a practical, deployable alternative for high-cardinality and rapidly changing legal label spaces.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The manuscript proposes casting multi-label legal annotation as a kNN retrieval task in the embedding space of a frozen retrieval model applied to documents and label descriptions. This enables extensible updates via re-indexing rather than retraining. Experiments on ECtHR-A, ECtHR-B, and Eurlex (100 labels) report competitive Macro- and Micro-F1 scores, strong data efficiency (e.g., N=100 samples nearly doubling Micro-F1 over hierarchical Legal-BERT on ECtHR-A), 20-30x compute savings versus fine-tuning, and zero hallucinations by construction versus GPT-5.2 zero-shot prompting.

Significance. If the central results hold, the work demonstrates a practical, hallucination-free alternative for high-cardinality and evolving legal taxonomies that avoids gradient updates and scales with limited supervision. The quantified data-efficiency gains and explicit contrast to generative hallucination rates provide a clear deployability argument for resource-constrained legal annotation pipelines.

major comments (3)
  1. [Approach / Methods] The central claim that frozen-embedding kNN reliably identifies label applicability for long, fact-intensive legal documents rests on an untested assumption; no ablation compares the frozen embedder against task-adapted or fine-tuned variants, which is load-bearing for the assertion that retrieval is sufficient without domain-specific adaptation.
  2. [Experimental Setup] k is listed as the sole free parameter, yet no concrete value, selection procedure, or sensitivity analysis is reported; this directly affects reproducibility of the claimed F1 improvements (e.g., Eurlex Macro-F1 49.12 and ECtHR-A Micro-F1 48.29 at N=100).
  3. [Results / Experiments] Table or results section reporting the GPT-5.2 and Legal-BERT baselines lacks dataset splits, exact training details, and error bars; without these it is impossible to verify whether the reported gains (40.41 to 49.12 Macro-F1 on Eurlex; 27.87 to 48.29 Micro-F1 on ECtHR-A) are robust to post-hoc implementation choices.
minor comments (2)
  1. [Abstract / Experiments] Clarify the exact model referred to as 'GPT-5.2' and whether deterministic decoding was used uniformly across all generative baselines.
  2. [Approach] Add a brief discussion of how label-description embeddings are constructed for the Eurlex 100-label taxonomy to aid replication.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We are grateful to the referee for the constructive comments that will help improve the clarity and reproducibility of our work. We address each major comment below and outline the revisions we plan to make.

read point-by-point responses
  1. Referee: [Approach / Methods] The central claim that frozen-embedding kNN reliably identifies label applicability for long, fact-intensive legal documents rests on an untested assumption; no ablation compares the frozen embedder against task-adapted or fine-tuned variants, which is load-bearing for the assertion that retrieval is sufficient without domain-specific adaptation.

    Authors: We thank the referee for highlighting this point. The manuscript intentionally focuses on the frozen setting to emphasize data efficiency, extensibility without retraining, and avoidance of hallucinations. The competitive results across datasets (e.g., outperforming GPT-5.2 zero-shot and hierarchical Legal-BERT with limited data) serve as empirical validation that frozen embeddings are sufficient for this task. A comparison to fine-tuned variants would be valuable but would require substantial additional experiments and compute, which we believe is beyond the current scope given the paper's emphasis on the no-adaptation advantage. We will add a paragraph in the discussion section to explicitly address this design decision and its relation to the claims. revision: no

  2. Referee: [Experimental Setup] k is listed as the sole free parameter, yet no concrete value, selection procedure, or sensitivity analysis is reported; this directly affects reproducibility of the claimed F1 improvements (e.g., Eurlex Macro-F1 49.12 and ECtHR-A Micro-F1 48.29 at N=100).

    Authors: We agree that this information is necessary for reproducibility. In the revised manuscript, we will report the specific value of k used in all experiments, describe how it was selected (e.g., via a small validation set), and include a sensitivity analysis plotting performance across a range of k values to demonstrate robustness. revision: yes

  3. Referee: [Results / Experiments] Table or results section reporting the GPT-5.2 and Legal-BERT baselines lacks dataset splits, exact training details, and error bars; without these it is impossible to verify whether the reported gains (40.41 to 49.12 Macro-F1 on Eurlex; 27.87 to 48.29 Micro-F1 on ECtHR-A) are robust to post-hoc implementation choices.

    Authors: We acknowledge the need for greater transparency in the baseline implementations. The revised version will include a detailed description of the dataset splits, the exact hyperparameters and training procedures for Legal-BERT, the prompting strategy for GPT-5.2, and error bars computed over multiple random seeds or runs where feasible. This will strengthen the verifiability of the reported improvements. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical results on external benchmarks

full rationale

The paper proposes casting multi-label legal annotation as kNN retrieval over frozen document and label-description embeddings, then reports accuracy, data-efficiency, and hallucination metrics from direct experiments on three public datasets (ECtHR-A, ECtHR-B, Eurlex) against fixed external baselines such as GPT-5.2 zero-shot and hierarchical Legal-BERT. No equations, parameters, or uniqueness claims reduce by construction to quantities defined inside the paper; the central performance numbers are measured on held-out test splits and are therefore independent of any internal fit or self-referential definition.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The approach depends on the domain assumption that frozen general-purpose retrieval embeddings capture legal label relevance and on the modeling choice of k in nearest-neighbor lookup; no new entities are postulated.

free parameters (1)
  • k (nearest neighbors)
    The number of neighbors used for label prediction is a modeling choice whose value is not stated in the abstract.
axioms (1)
  • domain assumption Embedding-space similarity between document and label description vectors indicates label applicability for legal texts.
    This premise underpins the decision to use frozen retrieval embeddings without task-specific training.

pith-pipeline@v0.9.0 · 5815 in / 1336 out tokens · 51058 ms · 2026-05-19T21:36:03.815453+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

37 extracted references · 37 canonical work pages · 8 internal anchors

  1. [1]

    K. D. Ashley, Artificial intelligence and legal analytics: new tools for law practice in the digital age, Cambridge University Press, 2017

  2. [2]

    Chalkidis, A

    I. Chalkidis, A. Jana, D. Hartung, M. Bommarito, I. Androutsopoulos, D. Katz, N. Aletras, Lexglue: A benchmark dataset for legal language understanding in english, in: Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2022, pp. 4310–4330

  3. [3]

    Aletras, D

    N. Aletras, D. Tsarapatsanis, D. Preoţiuc-Pietro, V. Lampos, Predicting judicial decisions of the european court of human rights: A natural language processing perspective, PeerJ computer science 2 (2016) e93

  4. [4]

    Chalkidis, E

    I. Chalkidis, E. Fergadiotis, P. Malakasiotis, I. Androutsopoulos, Large-scale multi-label text classification on eu legislation, in: Proceedings of the 57th annual meeting of the association for computational linguistics, 2019, pp. 6314–6322

  5. [5]

    Chang, H.-F

    W.-C. Chang, H.-F. Yu, K. Zhong, Y. Yang, I. S. Dhillon, Taming pretrained transformers for extreme multi-label text classification, in: Proceedings of the 26th ACM SIGKDD international conference on knowledge discovery & data mining, 2020, pp. 3163–3171

  6. [6]

    Chalkidis, M

    I. Chalkidis, M. Fergadiotis, I. Androutsopoulos, Multieurlex–a multi-lingual and multi-label legal document classification dataset for zero-shot cross-lingual transfer, arXiv preprint arXiv:2109.00904 (2021)

  7. [7]

    arXiv , eprintclass =

    I. Chalkidis, M. Fergadiotis, P. Malakasiotis, N. Aletras, I. Androutsopoulos, Legal-bert: The muppets straight out of law school, arXiv preprint arXiv:2010.02559 (2020)

  8. [8]

    Brown, B

    T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, et al., Language models are few-shot learners, Advances in neural information processing systems 33 (2020) 1877–1901

  9. [9]

    Savelka, K

    J. Savelka, K. D. Ashley, The unreasonable effectiveness of large language models in zero-shot semantic annotation of legal texts, Frontiers in Artificial Intelligence 6 (2023) 1279794

  10. [10]

    H. Lee, K. C. Li, M. Grabmair, S. Xu, Efficient prompt optimisation for legal text classification with proxy prompt evaluator, in: Proceedings of the Natural Legal Language Processing Workshop 2025, 2025, pp. 281–290

  11. [11]

    N. F. Liu, K. Lin, J. Hewitt, A. Paranjape, M. Bevilacqua, F. Petroni, P. Liang, Lost in the middle: How language models use long contexts, 2023, URL https://arxiv. org/abs/2307.03172 (2023)

  12. [12]

    Zhang, J

    L. Zhang, J. Savelka, K. Ashley, Do llms truly understand when a precedent is overruled?, arXiv preprint arXiv:2510.20941 (2025)

  13. [13]

    Cover, P

    T. Cover, P. Hart, Nearest neighbor pattern classification, IEEE transactions on information theory 13 (1967) 21–27

  14. [14]

    X. Chi, W. Zhong, Y. Wu, W. Wang, K. Kuang, F. Wu, M. Xiong, Universal legal article prediction via tight collaboration between supervised classification model and llm, in: Proceedings of the Twentieth International Conference on Artificial Intelligence and Law, 2025, pp. 21–30

  15. [15]

    Hendrycks, C

    D. Hendrycks, C. Burns, A. Chen, S. Ball, Cuad: An expert-annotated nlp dataset for legal contract review, arXiv preprint arXiv:2103.06268 (2021)

  16. [16]

    N. Guha, J. Nyarko, D. Ho, C. Ré, A. Chilton, A. Chohlas-Wood, A. Peters, B. Waldon, D. Rockmore, D. Zambrano, et al., Legalbench: A collaboratively built benchmark for measuring legal reasoning in large language models, Advances in neural information processing systems 36 (2023) 44123– 44279

  17. [17]

    N. Wais, M. Grabmair, Learning from computer vision: The effects of loss functions on legal text classification with class imbalance, in: Proceedings of the Twentieth International Conference on Artificial Intelligence and Law, 2025, pp. 41–50

  18. [18]

    M. Gray, J. Savelka, W. Oliver, K. Ashley, Using llms to discover legal factors, arXiv preprint arXiv:2410.07504 (2024)

  19. [19]

    K. Luo, Q. Huang, C. Jiang, Y. Feng, Automating legal interpretation with llms: Retrieval, generation, and evaluation, arXiv preprint arXiv:2501.01743 (2025)

  20. [20]

    J. Wei, X. Wang, D. Schuurmans, M. Bosma, F. Xia, E. Chi, Q. V. Le, D. Zhou, et al., Chain-of-thought prompting elicits reasoning in large language models, Advances in neural information processing systems 35 (2022) 24824–24837

  21. [21]

    S. Yao, D. Yu, J. Zhao, I. Shafran, T. Griffiths, Y. Cao, K. Narasimhan, Tree of thoughts: Deliberate problem solving with large language models, Advances in neural information processing systems 36 (2023) 11809–11822

  22. [22]

    Lewis, E

    P. Lewis, E. Perez, A. Piktus, F. Petroni, V. Karpukhin, N. Goyal, H. Küttler, M. Lewis, W.-t. Yih, T. Rocktäschel, et al., Retrieval-augmented generation for knowledge-intensive nlp tasks, Advances in neural information processing systems 33 (2020) 9459–9474

  23. [23]

    Bareham, K

    D. Bareham, K. Atkinson, J. Mumford, J. Marshall, Curb your enthusiasm: Towards a rag framework to forecast case importance in the echr, in: Proceedings of the Twentieth International Conference on Artificial Intelligence and Law, 2025, pp. 31–40

  24. [24]

    A. Yang, A. Li, B. Yang, B. Zhang, B. Hui, B. Zheng, B. Yu, C. Gao, C. Huang, C. Lv, et al., Qwen3 technical report, arXiv preprint arXiv:2505.09388 (2025)

  25. [25]

    Karpukhin, B

    V. Karpukhin, B. Oguz, S. Min, P. S. Lewis, L. Wu, S. Edunov, D. Chen, W.-t. Yih, Dense passage retrieval for open-domain question answering., in: EMNLP (1), 2020, pp. 6769–6781

  26. [26]

    Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks

    N. Reimers, I. Gurevych, Sentence-bert: Sentence embeddings using siamese bert-networks, arXiv preprint arXiv:1908.10084 (2019)

  27. [27]

    T. Gao, X. Yao, D. Chen, Simcse: Simple contrastive learning of sentence embeddings, arXiv preprint arXiv:2104.08821 (2021)

  28. [28]

    Johnson, M

    J. Johnson, M. Douze, H. Jégou, Billion-scale similarity search with gpus, IEEE Transactions on Big Data 7 (2019) 535–547

  29. [29]

    Small Language Models are the Future of Agentic AI

    P. Belcak, G. Heinrich, S. Diao, Y. Fu, X. Dong, S. Muralidharan, Y. C. Lin, P. Molchanov, Small language models are the future of agentic ai, arXiv preprint arXiv:2506.02153 (2025)

  30. [30]

    M. Gray, L. Zhang, K. D. Ashley, Generating case-based legal arguments with llms, in: Proceedings of the 2025 Symposium on Computer Science and Law, 2025, pp. 160–168

  31. [31]

    Zhang, K

    L. Zhang, K. D. Ashley, Mitigating manipulation and enhancing persuasion: A reflective multi- agent approach for legal argument generation, arXiv preprint arXiv:2506.02992 (2025)

  32. [32]

    Goebel, Y

    R. Goebel, Y. Kano, M.-Y. Kim, C. Kwan, K. Satoh, H. Yamada, M. Yoshioka, An overview of the COLIEE 2025 competition: Legal case law and statute law information retrieval and entailment, in: Proceedings of the Twentieth International Conference on Artificial Intelligence and Law, ICAIL 2025, ACM, 2025, pp. 506–515. doi:10.1145/3769126.3785016

  33. [33]

    C. Xiao, H. Zhong, Z. Guo, C. Tu, Z. Liu, M. Sun, Y. Feng, X. Han, Z. Hu, H. Wang, et al., Cail2018: A large-scale legal dataset for judgment prediction, arXiv preprint arXiv:1807.02478 (2018)

  34. [34]

    Scaling Laws for Neural Language Models

    J. Kaplan, S. McCandlish, T. Henighan, T. B. Brown, B. Chess, R. Child, S. Gray, A. Radford, J. Wu, D. Amodei, Scaling laws for neural language models, arXiv preprint arXiv:2001.08361 (2020)

  35. [35]

    Hoffmann, S

    J. Hoffmann, S. Borgeaud, A. Mensch, E. Buchatskaya, T. Cai, E. Rutherford, D. de Las Casas, L. A. Hendricks, J. Welbl, A. Clark, et al., An empirical analysis of compute-optimal large language model training, Advances in neural information processing systems 35 (2022) 30016–30030

  36. [36]

    Explainable Prediction of Medical Codes from Clinical Text

    J. Mullenbach, S. Wiegreffe, J. Duke, J. Sun, J. Eisenstein, Explainable prediction of medical codes from clinical text, arXiv preprint arXiv:1802.05695 (2018)

  37. [37]

    Tuggener, P

    D. Tuggener, P. Von Däniken, T. Peetz, M. Cieliebak, Ledgar: a large-scale multi-label corpus for text classification of legal provisions in contracts, in: Proceedings of the twelfth language resources and evaluation conference, 2020, pp. 1235–1241. 100 500 1000 2000 4500 9000 Training Samples 0.1 0.2 0.3 0.4 0.5 0.6Macro F1 ECtHR A: Macro F1 BERT-H Full-...