Retrieval with frozen embeddings and k-NN delivers competitive accuracy, high data efficiency, and zero hallucinations on legal multi-label annotation across ECtHR and Eurlex datasets.
Mandy Guo, Zihang Dai, Denny Vrandeˇci´c, and Rami Al-Rfou
3 Pith papers cite this work. Polarity classification is still indexing.
fields
cs.CL 3years
2026 3verdicts
UNVERDICTED 3representative citing papers
A new source-grounded QA dataset for U.S. immigration law is built from official documents and used to fine-tune a 3B model, yielding a 27% mean score improvement over the base model on a held-out sample.
Mimir is a 1.6B multilingual concept model pretrained on 38.9 billion sentences across 46 languages and instruction-tuned on 66.8 million sentences across 35 languages, then compared to a token-based LM of similar size.
citing papers explorer
-
Retrieval-Based Multi-Label Legal Annotation: Extensible, Data-Efficient and Hallucination-Free
Retrieval with frozen embeddings and k-NN delivers competitive accuracy, high data efficiency, and zero hallucinations on legal multi-label annotation across ECtHR and Eurlex datasets.
-
ImmigrationQA: A Source-Grounded Dataset and Small-Model Adaptation for U.S. Immigration Law
A new source-grounded QA dataset for U.S. immigration law is built from official documents and used to fine-tune a 3B model, yielding a 27% mean score improvement over the base model on a held-out sample.
-
Mimir: Large-scale Multilingual Concept Modeling
Mimir is a 1.6B multilingual concept model pretrained on 38.9 billion sentences across 46 languages and instruction-tuned on 66.8 million sentences across 35 languages, then compared to a token-based LM of similar size.