Multi-Legal-Bench creates a sparse 5x6 task-jurisdiction matrix across six countries and reports that few-shot effects replicate, no model dominates, cross-lingual transfer tracks label alignment more than language family, and tokenizer fertility does not predict accuracy.
Daniel Martin Katz, Michael James Bommarito, Shang Gao, and Pablo Arredondo
6 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
years
2026 6verdicts
UNVERDICTED 6roles
background 1polarities
background 1representative citing papers
UA-Legal-Bench is a new five-task benchmark for Ukrainian legal reasoning that demonstrates task-dependent few-shot prompting effects and the need for macro-F1 over accuracy on imbalanced classes.
LegalCiteBench reveals that current LLMs achieve under 7% accuracy on closed-book legal citation retrieval and completion tasks, with misleading answer rates above 94% for nearly all tested models.
Retrieval with frozen embeddings and k-NN delivers competitive accuracy, high data efficiency, and zero hallucinations on legal multi-label annotation across ECtHR and Eurlex datasets.
Domain-trained small language model Olava Extract outperforms frontier LLMs on structured contract extraction with macro F1 0.812, micro F1 0.842, highest precision, and 78-97% lower inference cost.
The paper releases a benchmark of ten life-insurance contracts, a domain ontology, and 58 evidence-linked scenarios that shows ontology-driven knowledge graph queries produce more consistent and diagnosable gap/overlap results than text-only LLM inference.
citing papers explorer
-
Multi-Legal-Bench: Evaluating LLMs on Legal Reasoning Across Jurisdictions, Languages, and Legal Traditions
Multi-Legal-Bench creates a sparse 5x6 task-jurisdiction matrix across six countries and reports that few-shot effects replicate, no model dominates, cross-lingual transfer tracks label alignment more than language family, and tokenizer fertility does not predict accuracy.
-
UA-Legal-Bench: A Benchmark for Evaluating Large Language Models on Ukrainian Legal Reasoning
UA-Legal-Bench is a new five-task benchmark for Ukrainian legal reasoning that demonstrates task-dependent few-shot prompting effects and the need for macro-F1 over accuracy on imbalanced classes.
-
LegalCiteBench: Evaluating Citation Reliability in Legal Language Models
LegalCiteBench reveals that current LLMs achieve under 7% accuracy on closed-book legal citation retrieval and completion tasks, with misleading answer rates above 94% for nearly all tested models.
-
Retrieval-Based Multi-Label Legal Annotation: Extensible, Data-Efficient and Hallucination-Free
Retrieval with frozen embeddings and k-NN delivers competitive accuracy, high data efficiency, and zero hallucinations on legal multi-label annotation across ECtHR and Eurlex datasets.
-
A Few Good Clauses: Comparing LLMs vs Domain-Trained Small Language Models on Structured Contract Extraction
Domain-trained small language model Olava Extract outperforms frontier LLMs on structured contract extraction with macro F1 0.812, micro F1 0.842, highest precision, and 78-97% lower inference cost.
-
A Benchmark for Gap and Overlap Analysis as a Test of KG Task Readiness
The paper releases a benchmark of ten life-insurance contracts, a domain ontology, and 58 evidence-linked scenarios that shows ontology-driven knowledge graph queries produce more consistent and diagnosable gap/overlap results than text-only LLM inference.