Medarabiq: Benchmarking large language models on arabic medical tasks

Mouath Abu Daoud, Chaimae Abouzahir, Leen Kharouf, Walid Al-Eisawi, Nizar Habash, Farah E Shamout · 2025 · arXiv 2505.03427

3 Pith papers cite this work. Polarity classification is still indexing.

3 Pith papers citing it

read on arXiv browse 3 citing papers

citation-role summary

background 1

citation-polarity summary

background 1

representative citing papers

Are Arabic Benchmarks Reliable? QIMMA's Quality-First Approach to LLM Evaluation

cs.CL · 2026-04-03 · unverdicted · novelty 7.0

QIMMA produces a validated multi-domain Arabic LLM benchmark of 52k samples by systematically detecting and correcting quality issues in prior resources via LLM-assisted and human review.

MedGuards: Multi-Agent System for Reliable Medical Error Detection and Correction

cs.CL · 2026-06-24 · unverdicted · novelty 6.0

MedGuards introduces a multi-agent in-context learning framework for medical error detection and correction plus the KPCS metric, reporting improvements on four multilingual clinical note datasets.

CLR-voyance: Reinforcing Open-Ended Reasoning for Inpatient Clinical Decision Support with Outcome-Aware Rubrics

cs.CL · 2026-05-10 · unverdicted · novelty 6.0

CLR-voyance reformulates inpatient reasoning as POMDP with clinician-validated outcome rubrics, yielding an 8B model that outperforms larger frontier models on the authors' new benchmark.

citing papers explorer

Showing 1 of 1 citing paper after filters.

Are Arabic Benchmarks Reliable? QIMMA's Quality-First Approach to LLM Evaluation cs.CL · 2026-04-03 · unverdicted · none · ref 4
QIMMA produces a validated multi-domain Arabic LLM benchmark of 52k samples by systematically detecting and correcting quality issues in prior resources via LLM-assisted and human review.

Medarabiq: Benchmarking large language models on arabic medical tasks

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer