pith. sign in

arxiv: 2603.23750 · v2 · submitted 2026-03-24 · 💻 cs.CL

IslamicMMLU: A Benchmark for Evaluating LLMs on Islamic Knowledge

Pith reviewed 2026-05-15 00:03 UTC · model grok-4.3

classification 💻 cs.CL
keywords LLM benchmarkIslamic knowledgeQuranHadithFiqhmodel evaluationmadhab bias
0
0 comments X

The pith

IslamicMMLU introduces a benchmark of 10,013 questions to test LLMs on Quran, Hadith, and Fiqh.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Large language models are consulted for Islamic knowledge but no standard test existed to measure their accuracy. This paper releases IslamicMMLU with 2,013 Quran questions, 4,000 Hadith questions, and 4,000 Fiqh questions, each track using varied question types to probe different capabilities. Evaluation of 26 models produces average accuracies from 39.8 percent to 93.8 percent, with the widest spread on Quran items and a new task that detects preferences for specific Islamic schools of jurisprudence. The public leaderboard makes these comparisons ongoing and transparent. A reader would care because reliance on LLMs for religious topics requires knowing where they succeed or fall short.

Core claim

IslamicMMLU is introduced as a benchmark of 10,013 multiple-choice questions spanning three tracks: Quran with 2,013 items, Hadith with 4,000 items, and Fiqh with 4,000 items. Each track contains multiple question types to examine how models handle different facets of Islamic knowledge. When 26 LLMs are tested, averaged accuracy across tracks ranges from 39.8 percent to 93.8 percent, the Quran track displays the largest performance spread from 32.4 percent to 99.3 percent, and the Fiqh track includes a madhab bias detection task that exposes variable school-of-thought preferences.

What carries the argument

The IslamicMMLU benchmark dataset itself, structured in three tracks with a madhab bias detection task in the Fiqh section.

If this is right

  • LLMs can be ranked on a public leaderboard according to accuracy on Islamic knowledge.
  • Models display measurable preferences for particular madhabs in the Fiqh track.
  • Arabic-specific models underperform frontier models across the three tracks.
  • The largest accuracy differences among models appear on the Quran track.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Developers could use the benchmark results to target improvements on question types where models score lowest.
  • Comparable benchmarks for other religious traditions would allow cross-cultural comparison of LLM reliability.
  • The leaderboard could inform choices for users who consult LLMs on Islamic topics.

Load-bearing premise

The questions accurately and representatively capture core Islamic knowledge without introducing selection bias or factual errors during creation and translation.

What would settle it

Independent review by Islamic scholars that identifies factual errors or systematic biases in a substantial fraction of the questions would show the benchmark does not measure what it claims.

read the original abstract

Large language models are increasingly consulted for Islamic knowledge, yet no comprehensive benchmark evaluates their performance across core Islamic disciplines. We introduce IslamicMMLU, a benchmark of 10,013 multiple-choice questions spanning three tracks: Quran (2,013 questions), Hadith (4,000 questions), and Fiqh (jurisprudence, 4,000 questions). Each track is formed of multiple types of questions to examine LLMs capabilities handling different aspects of Islamic knowledge. The benchmark is used to create the IslamicMMLU public leaderboard for evaluating LLMs, and we initially evaluate 26 LLMs, where their averaged accuracy across the three tracks varied between 39.8% to 93.8% (by Gemini 3 Flash). The Quran track shows the widest span (99.3% to 32.4%), while the Fiqh track includes a novel madhab (Islamic school of jurisprudence) bias detection task revealing variable school-of-thought preferences across models. Arabic-specific models show mixed results, but they all underperform compared to frontier models. The evaluation code and leaderboard are made publicly available.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces IslamicMMLU, a benchmark of 10,013 multiple-choice questions across three tracks—Quran (2,013 questions), Hadith (4,000 questions), and Fiqh (4,000 questions)—and uses it to evaluate 26 LLMs, reporting average accuracies ranging from 39.8% to 93.8% (highest for Gemini 3 Flash). It also presents a novel madhab bias detection task within the Fiqh track and releases the evaluation code and public leaderboard.

Significance. If the questions prove factually accurate and representative, the work fills a clear gap in domain-specific LLM evaluation for religious knowledge and provides reusable public artifacts. The madhab bias task adds a distinctive dimension by probing model preferences across Islamic schools of jurisprudence.

major comments (2)
  1. [Benchmark construction (abstract and §3)] Benchmark construction (abstract and §3): No details are provided on question sourcing from primary canonical texts (e.g., specific Hadith collections such as Sahih Bukhari), translation/validation protocols, or inter-expert agreement metrics. Without these, the central claim that the 10,013 questions form a reliable benchmark cannot be verified, directly affecting all reported accuracy figures.
  2. [Madhab bias task (§4.2)] Madhab bias task (§4.2): The description of how questions were constructed to detect school-of-thought preferences lacks specification of the sampling strategy across madhabs or any expert review process, leaving open the possibility that observed model biases reflect curation artifacts rather than genuine model behavior.
minor comments (2)
  1. [Table 1] Table 1: The track composition table would benefit from an additional column reporting the number of questions per sub-type (e.g., memorization vs. interpretation) to clarify coverage.
  2. [§5] §5: The discussion of Arabic-specific models could include a brief comparison of training data overlap with the benchmark sources to contextualize performance differences.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. The two major comments identify important gaps in methodological transparency that we agree must be addressed to strengthen the paper's claims about benchmark reliability. We provide point-by-point responses below and will incorporate the requested details in the revised version.

read point-by-point responses
  1. Referee: [Benchmark construction (abstract and §3)] Benchmark construction (abstract and §3): No details are provided on question sourcing from primary canonical texts (e.g., specific Hadith collections such as Sahih Bukhari), translation/validation protocols, or inter-expert agreement metrics. Without these, the central claim that the 10,013 questions form a reliable benchmark cannot be verified, directly affecting all reported accuracy figures.

    Authors: We acknowledge that the current manuscript provides insufficient detail on these aspects. In the revised version we will expand §3 with a new subsection that specifies the primary sources for each track (e.g., Sahih Bukhari and Sahih Muslim for the Hadith track, standard tafsir works for Quran, and classical fiqh manuals for the Fiqh track), describes the translation and expert validation protocols, and reports inter-expert agreement metrics such as percentage agreement and Cohen’s kappa. These additions will directly support the reliability of the reported accuracy figures. revision: yes

  2. Referee: [Madhab bias task (§4.2)] Madhab bias task (§4.2): The description of how questions were constructed to detect school-of-thought preferences lacks specification of the sampling strategy across madhabs or any expert review process, leaving open the possibility that observed model biases reflect curation artifacts rather than genuine model behavior.

    Authors: We agree that the current description is too brief. We will revise §4.2 to explicitly state the sampling strategy (balanced selection of questions with known differing rulings across the four major madhabs, drawn from authoritative fiqh sources) and the expert review process (independent review by two qualified scholars of Islamic jurisprudence to confirm question neutrality and accuracy). This will demonstrate that the detected biases are not artifacts of curation. revision: yes

Circularity Check

0 steps flagged

No significant circularity in benchmark introduction

full rationale

The paper's central contribution is the creation and public release of the IslamicMMLU dataset (10,013 questions across Quran, Hadith, and Fiqh tracks) plus an evaluation harness and leaderboard. No mathematical derivations, equations, fitted parameters, or predictive claims exist that could reduce to inputs by construction. Track composition, question types, and madhab bias task are described as curation choices rather than derived results. Self-citations, if present, are not load-bearing for the benchmark's validity, which rests on external expert validation and source attribution (not provided in detail but not claimed as internally derived). The evaluation of 26 LLMs is a straightforward application of the benchmark, not a self-referential prediction. This is a standard dataset paper with no circular steps.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the assumption that the collected questions faithfully represent Islamic disciplines; no free parameters, invented entities, or formal axioms are introduced.

axioms (1)
  • domain assumption The 10,013 questions accurately reflect core Islamic knowledge across the three tracks without factual or selection errors.
    Stated implicitly in the construction of the benchmark; no validation statistics are provided in the abstract.

pith-pipeline@v0.9.0 · 5506 in / 1197 out tokens · 38762 ms · 2026-05-15T00:03:54.971051+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.