QIMMA produces a validated multi-domain Arabic LLM benchmark of 52k samples by systematically detecting and correcting quality issues in prior resources via LLM-assisted and human review.
Medarabiq: Benchmarking large language models on arabic medical tasks
3 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
fields
cs.CL 3years
2026 3verdicts
UNVERDICTED 3roles
background 1polarities
background 1representative citing papers
MedGuards introduces a multi-agent in-context learning framework for medical error detection and correction plus the KPCS metric, reporting improvements on four multilingual clinical note datasets.
CLR-voyance reformulates inpatient reasoning as POMDP with clinician-validated outcome rubrics, yielding an 8B model that outperforms larger frontier models on the authors' new benchmark.
citing papers explorer
-
Are Arabic Benchmarks Reliable? QIMMA's Quality-First Approach to LLM Evaluation
QIMMA produces a validated multi-domain Arabic LLM benchmark of 52k samples by systematically detecting and correcting quality issues in prior resources via LLM-assisted and human review.
-
MedGuards: Multi-Agent System for Reliable Medical Error Detection and Correction
MedGuards introduces a multi-agent in-context learning framework for medical error detection and correction plus the KPCS metric, reporting improvements on four multilingual clinical note datasets.
-
CLR-voyance: Reinforcing Open-Ended Reasoning for Inpatient Clinical Decision Support with Outcome-Aware Rubrics
CLR-voyance reformulates inpatient reasoning as POMDP with clinician-validated outcome rubrics, yielding an 8B model that outperforms larger frontier models on the authors' new benchmark.