Magis-Bench is a new benchmark of 74 magistrate-level legal writing tasks from Brazilian exams where the strongest LLMs reach only 6.97/10, showing judicial reasoning remains difficult for current models.
arXiv preprint arXiv:2601.16669 , year=
3 Pith papers cite this work. Polarity classification is still indexing.
fields
cs.CL 3years
2026 3representative citing papers
ProHist-Bench shows that even state-of-the-art LLMs struggle with complex historical research questions requiring evidentiary reasoning, based on 400 questions and 10,891 rubrics from the Keju system.
LegalCiteBench reveals that current LLMs achieve under 7% accuracy on closed-book legal citation retrieval and completion tasks, with misleading answer rates above 94% for nearly all tested models.
citing papers explorer
-
Magis-Bench: Evaluating LLMs on Magistrate-Level Legal Tasks
Magis-Bench is a new benchmark of 74 magistrate-level legal writing tasks from Brazilian exams where the strongest LLMs reach only 6.97/10, showing judicial reasoning remains difficult for current models.
-
Can LLMs Act as Historians? Evaluating Historical Research Capabilities of LLMs via the Chinese Imperial Examination
ProHist-Bench shows that even state-of-the-art LLMs struggle with complex historical research questions requiring evidentiary reasoning, based on 400 questions and 10,891 rubrics from the Keju system.
-
LegalCiteBench: Evaluating Citation Reliability in Legal Language Models
LegalCiteBench reveals that current LLMs achieve under 7% accuracy on closed-book legal citation retrieval and completion tasks, with misleading answer rates above 94% for nearly all tested models.