LegalCiteBench reveals that current LLMs achieve under 7% accuracy on closed-book legal citation retrieval and completion tasks, with misleading answer rates above 94% for nearly all tested models.
arXiv preprint arXiv:2103.06268 , year=
4 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
years
2026 4verdicts
UNVERDICTED 4roles
background 1polarities
background 1representative citing papers
Retrieval with frozen embeddings and k-NN delivers competitive accuracy, high data efficiency, and zero hallucinations on legal multi-label annotation across ECtHR and Eurlex datasets.
Domain-trained small language model Olava Extract outperforms frontier LLMs on structured contract extraction with macro F1 0.812, micro F1 0.842, highest precision, and 78-97% lower inference cost.
The paper releases a benchmark of ten life-insurance contracts, a domain ontology, and 58 evidence-linked scenarios that shows ontology-driven knowledge graph queries produce more consistent and diagnosable gap/overlap results than text-only LLM inference.
citing papers explorer
-
LegalCiteBench: Evaluating Citation Reliability in Legal Language Models
LegalCiteBench reveals that current LLMs achieve under 7% accuracy on closed-book legal citation retrieval and completion tasks, with misleading answer rates above 94% for nearly all tested models.
-
Retrieval-Based Multi-Label Legal Annotation: Extensible, Data-Efficient and Hallucination-Free
Retrieval with frozen embeddings and k-NN delivers competitive accuracy, high data efficiency, and zero hallucinations on legal multi-label annotation across ECtHR and Eurlex datasets.
-
A Few Good Clauses: Comparing LLMs vs Domain-Trained Small Language Models on Structured Contract Extraction
Domain-trained small language model Olava Extract outperforms frontier LLMs on structured contract extraction with macro F1 0.812, micro F1 0.842, highest precision, and 78-97% lower inference cost.
-
A Benchmark for Gap and Overlap Analysis as a Test of KG Task Readiness
The paper releases a benchmark of ten life-insurance contracts, a domain ontology, and 58 evidence-linked scenarios that shows ontology-driven knowledge graph queries produce more consistent and diagnosable gap/overlap results than text-only LLM inference.