Recognition: unknown
SAHM: A Benchmark for Arabic Financial and Shari'ah-Compliant Reasoning
Pith reviewed 2026-05-10 03:08 UTC · model grok-4.3
The pith
A new benchmark shows Arabic fluency in LLMs does not ensure financial reasoning ability.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
We introduce Sahm, the first Arabic financial benchmark spanning seven tasks: AAOIFI standards QA, fatwa-based QA/MCQ, accounting and business exams, financial sentiment analysis, extractive summarization, and event-cause reasoning, comprising 14,380 expert-verified instances from authentic regulatory, juristic, and corporate sources. Evaluating 20 LLMs, we find Arabic fluency does not imply financial reasoning: models achieving 91% on recognition tasks drop sharply on generation, and event-cause reasoning exposes the widest performance gap (1.89-9.84/10). We release the benchmark and dataset to support trustworthy Arabic financial assistants.
What carries the argument
The SAHM benchmark with its seven specific tasks drawn from authentic Arabic financial and regulatory sources to evaluate LLMs on Shari'ah-compliant reasoning.
Load-bearing premise
The selected seven tasks and expert-verified instances from regulatory sources comprehensively and without bias represent the full scope of Arabic financial and Shari'ah-compliant reasoning needs across regions and dialects.
What would settle it
A new LLM achieving high performance on all tasks, including event-cause reasoning, after only general Arabic training without financial-specific data would challenge the identified performance gaps.
Figures
read the original abstract
English financial NLP has advanced rapidly through benchmarks targeting earnings analysis, market sentiment, tabular reasoning, and financial question answering, yet Arabic financial NLP remains virtually nonexistent, despite 422 million speakers, $4.9 trillion in Gulf sovereign wealth, and a $4-5 trillion Islamic finance industry requiring specialized Shari'ah compliance over instruments like sukuk, murabaha, and takaful. We introduce Sahm, the first Arabic financial benchmark spanning seven tasks: AAOIFI standards QA, fatwa-based QA/MCQ, accounting and business exams, financial sentiment analysis, extractive summarization, and event-cause reasoning, comprising 14,380 expert-verified instances from authentic regulatory, juristic, and corporate sources. Evaluating 20 LLMs, we find Arabic fluency does not imply financial reasoning: models achieving 91% on recognition tasks drop sharply on generation, and event-cause reasoning exposes the widest performance gap (1.89-9.84/10). We release the benchmark and dataset to support trustworthy Arabic financial assistants.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces SAHM, the first Arabic financial and Shari'ah-compliant reasoning benchmark comprising 14,380 expert-verified instances drawn from authentic regulatory, juristic, and corporate sources. It covers seven tasks (AAOIFI standards QA, fatwa-based QA/MCQ, accounting/business exams, financial sentiment analysis, extractive summarization, and event-cause reasoning) and evaluates 20 LLMs, concluding that Arabic fluency does not imply financial reasoning: models reach 91% on recognition tasks but drop sharply on generation, with the widest gaps on event-cause reasoning (1.89-9.84/10). The benchmark and dataset are released publicly.
Significance. If the dataset labels prove reliable, SAHM would fill a clear gap in Arabic financial NLP for a large speaker population and the $4-5 trillion Islamic finance industry. The empirical demonstration of task-specific performance drops and the use of authentic external sources (rather than synthetic data) are strengths; public release of the benchmark supports reproducibility and further work.
major comments (2)
- [Abstract and dataset construction section] Abstract and dataset construction section: the central claim rests on 14,380 'expert-verified' instances, yet no inter-expert agreement scores, expert qualification details, or disagreement-resolution protocol are reported. This is load-bearing for interpretive tasks (fatwa-based QA and AAOIFI standards) where rulings can vary by madhhab; without these metrics, systematic label noise could affect the validity of the reported performance gaps (e.g., 1.89-9.84/10 on event-cause reasoning).
- [Evaluation and results section] Evaluation and results section: the claim that 'Arabic fluency does not imply financial reasoning' is supported by raw performance numbers but lacks statistical tests (e.g., paired significance tests or confidence intervals) for the observed drops from recognition to generation tasks and the specific event-cause range. This weakens the strength of the cross-task and cross-model conclusions.
minor comments (2)
- [Abstract] Abstract: instance selection criteria and per-task instance counts are not summarized, making it harder to assess coverage of the claimed seven tasks.
- [Throughout] Throughout: Arabic terms (e.g., sukuk, murabaha, takaful) would benefit from consistent transliteration and brief English glosses on first use for broader readability.
Simulated Author's Rebuttal
We thank the referee for their constructive and detailed feedback, which helps strengthen the presentation of our benchmark. We address each major comment below and commit to revisions that improve transparency and rigor without altering the core findings.
read point-by-point responses
-
Referee: [Abstract and dataset construction section] Abstract and dataset construction section: the central claim rests on 14,380 'expert-verified' instances, yet no inter-expert agreement scores, expert qualification details, or disagreement-resolution protocol are reported. This is load-bearing for interpretive tasks (fatwa-based QA and AAOIFI standards) where rulings can vary by madhhab; without these metrics, systematic label noise could affect the validity of the reported performance gaps (e.g., 1.89-9.84/10 on event-cause reasoning).
Authors: We agree that explicit details on the verification process are essential for establishing benchmark reliability, especially for interpretive tasks. In the revised manuscript, we will expand the dataset construction section to describe the experts' qualifications (e.g., certified Shari'ah scholars with AAOIFI or equivalent credentials and professional accountants), the multi-stage review protocol used for disagreement resolution, and any available agreement metrics from the verification process. We will also add a limitations paragraph addressing potential madhhab-based variations and label noise, noting that sources were selected from authoritative, consensus-oriented regulatory bodies to mitigate this. These additions will better support the reported performance gaps. revision: yes
-
Referee: [Evaluation and results section] Evaluation and results section: the claim that 'Arabic fluency does not imply financial reasoning' is supported by raw performance numbers but lacks statistical tests (e.g., paired significance tests or confidence intervals) for the observed drops from recognition to generation tasks and the specific event-cause range. This weakens the strength of the cross-task and cross-model conclusions.
Authors: We concur that statistical support would strengthen the cross-task and cross-model claims. In the revised evaluation and results section, we will add 95% confidence intervals for all task scores and include paired statistical tests (e.g., Wilcoxon signed-rank tests for recognition vs. generation drops and appropriate post-hoc tests for event-cause reasoning ranges across models). This will provide quantitative evidence for the performance disparities while preserving the original empirical observations. revision: yes
Circularity Check
No circularity: benchmark compiled from external sources and evaluated directly on LLMs
full rationale
The paper introduces SAHM by collecting 14,380 instances from authentic regulatory, juristic, and corporate sources, followed by expert verification. It then runs standard evaluations of 20 off-the-shelf LLMs on seven tasks and reports raw performance numbers (e.g., recognition vs. generation gaps). No equations, fitted parameters, self-definitional claims, or load-bearing self-citations appear in the derivation chain. The central results are direct measurements against external data and models, not reductions to the paper's own inputs by construction.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Expert verification by domain specialists ensures the quality, accuracy, and representativeness of the 14,380 benchmark instances.
Reference graph
Works this paper leans on
-
[1]
Gemma 3 technical report . ArXiv preprint , abs/2503.19786. Chin- Y ew Lin. 2004. ROUGE: A package for auto- matic evaluation of summaries . In Text Summariza- tion Branches Out , pages 74–81, Barcelona, Spain. Association for Computational Linguistics. Macedo Maia, Siegfried Handschuh, André Freitas, Brian Davis, Ross McDermott, Manel Zarrouk, and Alexan...
work page internal anchor Pith review Pith/arXiv arXiv 2004
-
[2]
CFinBench: A comprehensive Chinese finan- cial benchmark for large language models . In Pro- ceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Com- putational Linguistics: Human Language Technolo- gies (Volume 1: Long Papers) , NAACL ’25, pages 876–891, Albuquerque, New Mexico. Association for Computational Ling...
-
[3]
FinChain: A Symbolic Benchmark for Verifiable Chain-of-Thought Financial Reasoning
FinChain: A symbolic benchmark for verifi- able chain-of-thought financial reasoning . Preprint, arXiv:2506.02515. An Y ang, Baosong Y ang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, Huan Lin, Jian Y ang, Jian- hong Tu, Jianwei Zhang, Jianxin Y ang, Jiaxi Y ang, Jingren Zhou, Junyang Lin, Kai Dang, K...
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[4]
Before editing, set IS_MAINLY_REFERRAL: • ”YES” if the answer mainly redirects to another fatwā, link, or reference and does not provide a substantive independent ruling
Referral flag. Before editing, set IS_MAINLY_REFERRAL: • ”YES” if the answer mainly redirects to another fatwā, link, or reference and does not provide a substantive independent ruling. • ”NO” otherwise
-
[5]
Clean the question. Edit minimally while pre- serving wording and fiqh intent: • Remove greetings, honorifics, and personal ap- peals (e.g.,ᆙᆊ ݿﺍܳފ.) •Remove formal closings (e.g.,ﺃﺭۏި݁ٷ ﻭ ﺍ.) •Remove the scholar’s name if it is only a form of address; keep it only if the question explicitly seeks that scholar’s specific fatwā or opinion. • Ensure the ...
-
[6]
ّأݞߌ߳ﺍࡺ࢘ࢦިلܭﺍ ﻭﺍܳފٷڎﺍﺕ
Clean the answer. Edit minimally while preserv- ing wording and reasoning: • Remove formal openings and closings so the an- swer starts with substantive content. • Remove all fatwā numbers, hyperlinks, and nav- igational phrases, editing surrounding text just enough to remain grammatical. • Convert Arabic-Indic numerals to Western numer- als. • Remove pur...
-
[7]
T wo native Arabic financial experts independently annotate a pilot subset of 20 reports, each producing an event–cause question and an analytical answer
Pilot annotation. T wo native Arabic financial experts independently annotate a pilot subset of 20 reports, each producing an event–cause question and an analytical answer
-
[8]
We evaluate agreement at two complementary levels: • Event–cause identification: measured using Cohen’s κ, assessing consistency in identifying salient events and their causes
Agreement assessment. We evaluate agreement at two complementary levels: • Event–cause identification: measured using Cohen’s κ, assessing consistency in identifying salient events and their causes. • Answer consistency: measured using ROUGE overlap between independently written answers, used as a consis- tency check rather than a correctness metric
-
[9]
Calibration. Annotators review disagreements from the pilot phase, discuss ambiguous cases (e.g., implicit causal- ity, multi-factor events, overlapping economic drivers), and refine shared annotation criteria. This calibration aligns interpretation standards and reduces annotation drift
-
[10]
After calibration, one expert annotates the remaining reports under the agreed guidelines
Full annotation. After calibration, one expert annotates the remaining reports under the agreed guidelines
-
[11]
A senior annotator audits a random sample of completed annotations to verify that each instance: • Identifies a plausible event and its cause(s) supported by the report
Audit and correction. A senior annotator audits a random sample of completed annotations to verify that each instance: • Identifies a plausible event and its cause(s) supported by the report. • Includes relevant numerical evidence when available. • Provides an analytical explanation rather than a descriptive summary. Annotations that fail these checks are...
-
[12]
The candidate must clearly state the same central hukm (e.g., permissibility/prohi- bition, validity/invalidity) and include the key justification present in the ground truth
Coverage of core ruling (0–4). The candidate must clearly state the same central hukm (e.g., permissibility/prohi- bition, validity/invalidity) and include the key justification present in the ground truth. One-word/minimal answers without essential justification should receive a much lower score (e.g., 0–1)
-
[13]
Does it retain critical restrictions, qualifiers, or carve-outs that materi- ally affect the ruling?
Conditions, exceptions, constraints (0–2). Does it retain critical restrictions, qualifiers, or carve-outs that materi- ally affect the ruling?
-
[14]
No misstatements that would change the fatwa; no implicit legalization of pro- hibited elements (e.g., ribā); no misleading generalizations or invented requirements
Doctrinal/factual accuracy (0–2). No misstatements that would change the fatwa; no implicit legalization of pro- hibited elements (e.g., ribā); no misleading generalizations or invented requirements
-
[15]
Clear Arabic, understandable structure, minimal ambiguity appropriate for a fatwa answer
Clarity & Arabic language quality (0–1). Clear Arabic, understandable structure, minimal ambiguity appropriate for a fatwa answer
-
[16]
Directly answers the question; avoids long digressions; phrasing suitable for a fatwa
Directness & fatwa format (0–1). Directly answers the question; avoids long digressions; phrasing suitable for a fatwa. Critical checks (true/false). • contradicts_ground_truth: Does the candidate contradict the central ruling? • omits_critical_conditions: Does it omit key conditions/exceptions that change the ruling? • introduces_unlawful_elements: Does ...
-
[17]
Coverage of core ruling (0–4)
-
[18]
Conditions, exceptions, constraints (0–2)
-
[19]
Doctrinal/factual accuracy (0–2)
-
[20]
Clarity & Arabic language quality (0–1)
-
[21]
Critical checks (true/false)
Directness & on-topic (0–1). Critical checks (true/false). • contradicts_ground_truth • omits_critical_conditions • introduces_unlawful_elements • hallucinated_citations • non_answer_or_evasive • off_topic_or_unsafe Output format (strict). Output only valid JSON (no prose, no code fences), following this schema: { ”scores”: {”coverage_core_ruling”: <float...
-
[22]
Does the candidate capture the main thesis and key takeaways of the ground truth (what/why/so-what)?
Core conclusion alignment (0–4). Does the candidate capture the main thesis and key takeaways of the ground truth (what/why/so-what)?
-
[23]
Correctly cites/uses the reported numbers (e.g., percentages, amounts, maturities, oversubscription) without inventing or altering figures
Quantitative fidelity & use of figures (0–2). Correctly cites/uses the reported numbers (e.g., percentages, amounts, maturities, oversubscription) without inventing or altering figures. Any simple computations/comparisons must be consistent
-
[24]
Causality and mechanisms are plausible and consistent with standard fi- nance/econ logic (e.g., pricing vs
Financial reasoning soundness (0–2). Causality and mechanisms are plausible and consistent with standard fi- nance/econ logic (e.g., pricing vs. credit risk, duration/tenor structure, demand/oversubscription signals, capital adequacy)
-
[25]
Clear Arabic, coherent structure, minimal ambiguity
Clarity & Arabic language quality (0–1). Clear Arabic, coherent structure, minimal ambiguity
-
[26]
Significant Dispute
Directness & on-topic grounding (0–1). Answers what was asked; stays anchored in the provided scenario/data (no generic filler). Critical checks (true/false). • contradicts_ground_truth: contradicts the central conclusion of the reference • fabricates_or_alters_numbers: introduces numbers not present or materially distorts reported figures • hallucinates_...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.