In Proceedings of NAACL-HLT , pages 107–112

URLhttps:// arxiv · 2023 · arXiv 2408.07983

2 Pith papers cite this work. Polarity classification is still indexing.

2 Pith papers citing it

representative citing papers

Multi-Legal-Bench: Evaluating LLMs on Legal Reasoning Across Jurisdictions, Languages, and Legal Traditions

cs.CL · 2026-05-28 · unverdicted · novelty 8.0

Multi-Legal-Bench creates a sparse 5x6 task-jurisdiction matrix across six countries and reports that few-shot effects replicate, no model dominates, cross-lingual transfer tracks label alignment more than language family, and tokenizer fertility does not predict accuracy.

Are Arabic Benchmarks Reliable? QIMMA's Quality-First Approach to LLM Evaluation

cs.CL · 2026-04-03 · unverdicted · novelty 7.0

QIMMA produces a validated multi-domain Arabic LLM benchmark of 52k samples by systematically detecting and correcting quality issues in prior resources via LLM-assisted and human review.

citing papers explorer

Showing 2 of 2 citing papers after filters.

Multi-Legal-Bench: Evaluating LLMs on Legal Reasoning Across Jurisdictions, Languages, and Legal Traditions cs.CL · 2026-05-28 · unverdicted · none · ref 9
Multi-Legal-Bench creates a sparse 5x6 task-jurisdiction matrix across six countries and reports that few-shot effects replicate, no model dominates, cross-lingual transfer tracks label alignment more than language family, and tokenizer fertility does not predict accuracy.
Are Arabic Benchmarks Reliable? QIMMA's Quality-First Approach to LLM Evaluation cs.CL · 2026-04-03 · unverdicted · none · ref 6
QIMMA produces a validated multi-domain Arabic LLM benchmark of 52k samples by systematically detecting and correcting quality issues in prior resources via LLM-assisted and human review.

In Proceedings of NAACL-HLT , pages 107–112

fields

years

verdicts

representative citing papers

citing papers explorer