Multi-Legal-Bench creates a sparse 5x6 task-jurisdiction matrix across six countries and reports that few-shot effects replicate, no model dominates, cross-lingual transfer tracks label alignment more than language family, and tokenizer fertility does not predict accuracy.
In Proceedings of NAACL-HLT , pages 107–112
2 Pith papers cite this work. Polarity classification is still indexing.
2
Pith papers citing it
fields
cs.CL 2years
2026 2verdicts
UNVERDICTED 2representative citing papers
QIMMA produces a validated multi-domain Arabic LLM benchmark of 52k samples by systematically detecting and correcting quality issues in prior resources via LLM-assisted and human review.
citing papers explorer
-
Multi-Legal-Bench: Evaluating LLMs on Legal Reasoning Across Jurisdictions, Languages, and Legal Traditions
Multi-Legal-Bench creates a sparse 5x6 task-jurisdiction matrix across six countries and reports that few-shot effects replicate, no model dominates, cross-lingual transfer tracks label alignment more than language family, and tokenizer fertility does not predict accuracy.
-
Are Arabic Benchmarks Reliable? QIMMA's Quality-First Approach to LLM Evaluation
QIMMA produces a validated multi-domain Arabic LLM benchmark of 52k samples by systematically detecting and correcting quality issues in prior resources via LLM-assisted and human review.