pith. machine review for the scientific record. sign in

arxiv: 2604.19098 · v2 · submitted 2026-04-21 · 💻 cs.CL · cs.AI· cs.LG

Recognition: unknown

SAHM: A Benchmark for Arabic Financial and Shari'ah-Compliant Reasoning

Authors on Pith no claims yet

Pith reviewed 2026-05-10 03:08 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.LG
keywords Arabic NLPfinancial benchmarkShari'ah complianceLLM evaluationIslamic financeevent reasoning
0
0 comments X

The pith

A new benchmark shows Arabic fluency in LLMs does not ensure financial reasoning ability.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces SAHM, the first benchmark for Arabic financial and Shari'ah-compliant reasoning. It covers seven tasks including standards QA, fatwa questions, accounting exams, sentiment analysis, summarization, and event-cause reasoning with over 14,000 expert-verified examples from real sources. Testing twenty language models reveals that strong performance on language recognition tasks does not translate to generation tasks or complex reasoning like identifying causes of financial events. This matters because Arabic-speaking regions have massive Islamic finance markets that require precise compliance knowledge. The benchmark aims to support development of more reliable Arabic financial AI tools.

Core claim

We introduce Sahm, the first Arabic financial benchmark spanning seven tasks: AAOIFI standards QA, fatwa-based QA/MCQ, accounting and business exams, financial sentiment analysis, extractive summarization, and event-cause reasoning, comprising 14,380 expert-verified instances from authentic regulatory, juristic, and corporate sources. Evaluating 20 LLMs, we find Arabic fluency does not imply financial reasoning: models achieving 91% on recognition tasks drop sharply on generation, and event-cause reasoning exposes the widest performance gap (1.89-9.84/10). We release the benchmark and dataset to support trustworthy Arabic financial assistants.

What carries the argument

The SAHM benchmark with its seven specific tasks drawn from authentic Arabic financial and regulatory sources to evaluate LLMs on Shari'ah-compliant reasoning.

Load-bearing premise

The selected seven tasks and expert-verified instances from regulatory sources comprehensively and without bias represent the full scope of Arabic financial and Shari'ah-compliant reasoning needs across regions and dialects.

What would settle it

A new LLM achieving high performance on all tasks, including event-cause reasoning, after only general Arabic training without financial-specific data would challenge the identified performance gaps.

Figures

Figures reproduced from arXiv: 2604.19098 by Ahmed Heakl, Dani Bouch, Jimin Huang, Marwa Elsaid khalil, Momina Ahsan, Muhra AlMahri, Preslav Nakov, Rania Elbadry, Salem Lahlou, Sarfraz Ahmad, Sophia Ananiadou, Veselin Stoyanov, Xueqing Peng, Yuxia Wang, Zhuohan Xie.

Figure 1
Figure 1. Figure 1: Examples of the diverse tasks included in SAHM, covering juristic Q&A, business and accounting MCQs, financial sentiment analysis, report summarization, & event causal reasoning. • A comprehensive benchmark of 20 LLMs show￾ing that Arabic fluency does not guarantee fi￾nancial reasoning: models that score up to 91% on MCQ-style tasks degrade substantially on open-ended generation, with the largest gap on Ev… view at source ↗
Figure 2
Figure 2. Figure 2: Pipeline for constructing the Islamic Finance Shari’ah Standards QA dataset. A hybrid LLMs￾human pipeline converts AAOIFI standards into QA pairs through OCR and generation stages, each followed by expert verification to ensure linguistic accuracy and legal fidelity. 3 SAHM We introduce SAHM, a comprehensive benchmark for evaluating Arabic financial reasoning across di￾verse, real-world tasks spanning Isla… view at source ↗
Figure 3
Figure 3. Figure 3: Effect of reasoning token budget on rul￾ing accuracy. Green indicates improvement with in￾creased budget, red indicates decline, and blue indicates no change. We validate the judge with two expert Arabic an￾notators on 200 randomly sampled outputs across the three tasks (MSE 0.41, Pearson r=0.92; inter￾annotator agreement κ=0.84 on discretized scores; Appendix J). All judge and model generations use greedy… view at source ↗
Figure 4
Figure 4. Figure 4: Qualitative error analysis showing representative failure modes. Left: Islamic knowledge error where Gemma-3-27B incorrectly rules a permissible transaction as forbidden, citing fabricated evidence with wrong word￾ing of authentic Hadith. Right: Concept confusion error where Qwen2.5-72B conflates total interest incurred with capitalizable interest in a construction loan scenario. 0 100 200 300 400 500 Numb… view at source ↗
Figure 5
Figure 5. Figure 5: Models Talk More, Not Better. Despite models generating 4-6× more fatwas text than human, models do not achieve proportionally higher accuracy, indicating that verbosity serves as proxy for uncertainty rather than expertise. 5 Results We analyze model behavior across tasks to under￾stand the relationship between Arabic fluency and financial reasoning. Our findings highlight consis￾tent patterns in performa… view at source ↗
Figure 6
Figure 6. Figure 6: Root cause distribution of model errors across [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Effect of number of evidences from Hadith [PITH_FULL_IMAGE:figures/full_fig_p009_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: OCR quality evaluation interface for the Shari’ah Standards QA dataset. The tool displays each scanned [PITH_FULL_IMAGE:figures/full_fig_p014_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Prompt for Arabic OCR text extraction with [PITH_FULL_IMAGE:figures/full_fig_p015_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Prompt for generating Arabic QA pairs from [PITH_FULL_IMAGE:figures/full_fig_p015_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Custom annotation interface used to validate automatically generated multiple-choice questions (MCQs) [PITH_FULL_IMAGE:figures/full_fig_p016_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Prompt for Arabic fatwā Q&A normaliza￾tion with minimal editing and preservation of juristic intent. Category Total Count 4,888 (زႤ၍ة) Zakat 2,454 (رً؇) Riba Murabaha (۰ොຳاਵਦ) 1,389 Gharar (رਵؗ) 860 730 (وڢژ) Waqf 571 (إ༥؇رة) Ijara Maysir (๤ཏ྘݁) 372 242 (݁ލ؇رᄎც (Musharaka 228 (݁ݯ؇ر۰ً) Mudharaba Takaful (ڣܭႤၽّ) 187 32 (ݬܝިك) Sukuk Total records 11,953 [PITH_FULL_IMAGE:figures/full_fig_p018_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Custom annotation platform used to label Arabic financial reports for sentiment analysis. Annotators [PITH_FULL_IMAGE:figures/full_fig_p019_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Guidelines for document-level sentiment an [PITH_FULL_IMAGE:figures/full_fig_p019_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: Custom web-based annotation interface for extractive summarization. Annotators view Arabic financial [PITH_FULL_IMAGE:figures/full_fig_p020_15.png] view at source ↗
Figure 16
Figure 16. Figure 16: Guidelines for extractive summarization an [PITH_FULL_IMAGE:figures/full_fig_p022_16.png] view at source ↗
Figure 17
Figure 17. Figure 17: Guidelines and quality control workflow for event–cause reasoning annotation in Arabic financial re [PITH_FULL_IMAGE:figures/full_fig_p023_17.png] view at source ↗
Figure 18
Figure 18. Figure 18: Prompt for extracting MCQs from Arabic accounting exams with exercise-based layouts. Prompt for Extracting Arabic Business Exam MCQs (Tabular Format) Task. You are an expert system for extracting Ara￾bic business and accounting exam questions from scanned images containing tabular layouts. Document characteristics. • Each row corresponds to one question. • Questions are numbered using Arabic numerals (e.g… view at source ↗
Figure 20
Figure 20. Figure 20: Evaluation rubric used for LLM-based judgment of fatwa QA responses. [PITH_FULL_IMAGE:figures/full_fig_p026_20.png] view at source ↗
Figure 21
Figure 21. Figure 21: Evaluation rubric used for LLM-based judg [PITH_FULL_IMAGE:figures/full_fig_p027_21.png] view at source ↗
Figure 22
Figure 22. Figure 22: Evaluation rubric used for LLM-based judgment of financial analysis and event–cause reasoning tasks. [PITH_FULL_IMAGE:figures/full_fig_p028_22.png] view at source ↗
read the original abstract

English financial NLP has advanced rapidly through benchmarks targeting earnings analysis, market sentiment, tabular reasoning, and financial question answering, yet Arabic financial NLP remains virtually nonexistent, despite 422 million speakers, $4.9 trillion in Gulf sovereign wealth, and a $4-5 trillion Islamic finance industry requiring specialized Shari'ah compliance over instruments like sukuk, murabaha, and takaful. We introduce Sahm, the first Arabic financial benchmark spanning seven tasks: AAOIFI standards QA, fatwa-based QA/MCQ, accounting and business exams, financial sentiment analysis, extractive summarization, and event-cause reasoning, comprising 14,380 expert-verified instances from authentic regulatory, juristic, and corporate sources. Evaluating 20 LLMs, we find Arabic fluency does not imply financial reasoning: models achieving 91% on recognition tasks drop sharply on generation, and event-cause reasoning exposes the widest performance gap (1.89-9.84/10). We release the benchmark and dataset to support trustworthy Arabic financial assistants.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces SAHM, the first Arabic financial and Shari'ah-compliant reasoning benchmark comprising 14,380 expert-verified instances drawn from authentic regulatory, juristic, and corporate sources. It covers seven tasks (AAOIFI standards QA, fatwa-based QA/MCQ, accounting/business exams, financial sentiment analysis, extractive summarization, and event-cause reasoning) and evaluates 20 LLMs, concluding that Arabic fluency does not imply financial reasoning: models reach 91% on recognition tasks but drop sharply on generation, with the widest gaps on event-cause reasoning (1.89-9.84/10). The benchmark and dataset are released publicly.

Significance. If the dataset labels prove reliable, SAHM would fill a clear gap in Arabic financial NLP for a large speaker population and the $4-5 trillion Islamic finance industry. The empirical demonstration of task-specific performance drops and the use of authentic external sources (rather than synthetic data) are strengths; public release of the benchmark supports reproducibility and further work.

major comments (2)
  1. [Abstract and dataset construction section] Abstract and dataset construction section: the central claim rests on 14,380 'expert-verified' instances, yet no inter-expert agreement scores, expert qualification details, or disagreement-resolution protocol are reported. This is load-bearing for interpretive tasks (fatwa-based QA and AAOIFI standards) where rulings can vary by madhhab; without these metrics, systematic label noise could affect the validity of the reported performance gaps (e.g., 1.89-9.84/10 on event-cause reasoning).
  2. [Evaluation and results section] Evaluation and results section: the claim that 'Arabic fluency does not imply financial reasoning' is supported by raw performance numbers but lacks statistical tests (e.g., paired significance tests or confidence intervals) for the observed drops from recognition to generation tasks and the specific event-cause range. This weakens the strength of the cross-task and cross-model conclusions.
minor comments (2)
  1. [Abstract] Abstract: instance selection criteria and per-task instance counts are not summarized, making it harder to assess coverage of the claimed seven tasks.
  2. [Throughout] Throughout: Arabic terms (e.g., sukuk, murabaha, takaful) would benefit from consistent transliteration and brief English glosses on first use for broader readability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback, which helps strengthen the presentation of our benchmark. We address each major comment below and commit to revisions that improve transparency and rigor without altering the core findings.

read point-by-point responses
  1. Referee: [Abstract and dataset construction section] Abstract and dataset construction section: the central claim rests on 14,380 'expert-verified' instances, yet no inter-expert agreement scores, expert qualification details, or disagreement-resolution protocol are reported. This is load-bearing for interpretive tasks (fatwa-based QA and AAOIFI standards) where rulings can vary by madhhab; without these metrics, systematic label noise could affect the validity of the reported performance gaps (e.g., 1.89-9.84/10 on event-cause reasoning).

    Authors: We agree that explicit details on the verification process are essential for establishing benchmark reliability, especially for interpretive tasks. In the revised manuscript, we will expand the dataset construction section to describe the experts' qualifications (e.g., certified Shari'ah scholars with AAOIFI or equivalent credentials and professional accountants), the multi-stage review protocol used for disagreement resolution, and any available agreement metrics from the verification process. We will also add a limitations paragraph addressing potential madhhab-based variations and label noise, noting that sources were selected from authoritative, consensus-oriented regulatory bodies to mitigate this. These additions will better support the reported performance gaps. revision: yes

  2. Referee: [Evaluation and results section] Evaluation and results section: the claim that 'Arabic fluency does not imply financial reasoning' is supported by raw performance numbers but lacks statistical tests (e.g., paired significance tests or confidence intervals) for the observed drops from recognition to generation tasks and the specific event-cause range. This weakens the strength of the cross-task and cross-model conclusions.

    Authors: We concur that statistical support would strengthen the cross-task and cross-model claims. In the revised evaluation and results section, we will add 95% confidence intervals for all task scores and include paired statistical tests (e.g., Wilcoxon signed-rank tests for recognition vs. generation drops and appropriate post-hoc tests for event-cause reasoning ranges across models). This will provide quantitative evidence for the performance disparities while preserving the original empirical observations. revision: yes

Circularity Check

0 steps flagged

No circularity: benchmark compiled from external sources and evaluated directly on LLMs

full rationale

The paper introduces SAHM by collecting 14,380 instances from authentic regulatory, juristic, and corporate sources, followed by expert verification. It then runs standard evaluations of 20 off-the-shelf LLMs on seven tasks and reports raw performance numbers (e.g., recognition vs. generation gaps). No equations, fitted parameters, self-definitional claims, or load-bearing self-citations appear in the derivation chain. The central results are direct measurements against external data and models, not reductions to the paper's own inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the construction of a new benchmark from authentic external sources with expert verification; no free parameters or invented entities are introduced, and the main unstated premise is that the chosen tasks and sources adequately cover the domain.

axioms (1)
  • domain assumption Expert verification by domain specialists ensures the quality, accuracy, and representativeness of the 14,380 benchmark instances.
    Invoked to support the claim that the dataset is reliable for evaluating financial and Shari'ah reasoning.

pith-pipeline@v0.9.0 · 5549 in / 1502 out tokens · 56181 ms · 2026-05-10T03:08:46.977993+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

26 extracted references · 3 canonical work pages · 2 internal anchors

  1. [1]

    Gemma 3 Technical Report

    Gemma 3 technical report . ArXiv preprint , abs/2503.19786. Chin- Y ew Lin. 2004. ROUGE: A package for auto- matic evaluation of summaries . In Text Summariza- tion Branches Out , pages 74–81, Barcelona, Spain. Association for Computational Linguistics. Macedo Maia, Siegfried Handschuh, André Freitas, Brian Davis, Ross McDermott, Manel Zarrouk, and Alexan...

  2. [2]

    CFinBench: A comprehensive Chinese finan- cial benchmark for large language models . In Pro- ceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Com- putational Linguistics: Human Language Technolo- gies (Volume 1: Long Papers) , NAACL ’25, pages 876–891, Albuquerque, New Mexico. Association for Computational Ling...

  3. [3]

    FinChain: A Symbolic Benchmark for Verifiable Chain-of-Thought Financial Reasoning

    FinChain: A symbolic benchmark for verifi- able chain-of-thought financial reasoning . Preprint, arXiv:2506.02515. An Y ang, Baosong Y ang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, Huan Lin, Jian Y ang, Jian- hong Tu, Jianwei Zhang, Jianxin Y ang, Jiaxi Y ang, Jingren Zhou, Junyang Lin, Kai Dang, K...

  4. [4]

    Before editing, set IS_MAINLY_REFERRAL: • ”YES” if the answer mainly redirects to another fatwā, link, or reference and does not provide a substantive independent ruling

    Referral flag. Before editing, set IS_MAINLY_REFERRAL: • ”YES” if the answer mainly redirects to another fatwā, link, or reference and does not provide a substantive independent ruling. • ”NO” otherwise

  5. [5]

    Clean the question. Edit minimally while pre- serving wording and fiqh intent: • Remove greetings, honorifics, and personal ap- peals (e.g.,ᆙᆊ ￿ݿ￿ﺍܳފ.) •Remove formal closings (e.g.,ﺃﺭۏި݁ٷ ￿ﻭ ﺍ.) •Remove the scholar’s name if it is only a form of address; keep it only if the question explicitly seeks that scholar’s specific fatwā or opinion. • Ensure the ...

  6. [6]

    ّأݞߌ߳ﺍࡺ࢘ࢦިلܭﺍ ﻭﺍܳފٷڎﺍﺕ

    Clean the answer. Edit minimally while preserv- ing wording and reasoning: • Remove formal openings and closings so the an- swer starts with substantive content. • Remove all fatwā numbers, hyperlinks, and nav- igational phrases, editing surrounding text just enough to remain grammatical. • Convert Arabic-Indic numerals to Western numer- als. • Remove pur...

  7. [7]

    T wo native Arabic financial experts independently annotate a pilot subset of 20 reports, each producing an event–cause question and an analytical answer

    Pilot annotation. T wo native Arabic financial experts independently annotate a pilot subset of 20 reports, each producing an event–cause question and an analytical answer

  8. [8]

    We evaluate agreement at two complementary levels: • Event–cause identification: measured using Cohen’s κ, assessing consistency in identifying salient events and their causes

    Agreement assessment. We evaluate agreement at two complementary levels: • Event–cause identification: measured using Cohen’s κ, assessing consistency in identifying salient events and their causes. • Answer consistency: measured using ROUGE overlap between independently written answers, used as a consis- tency check rather than a correctness metric

  9. [9]

    Calibration. Annotators review disagreements from the pilot phase, discuss ambiguous cases (e.g., implicit causal- ity, multi-factor events, overlapping economic drivers), and refine shared annotation criteria. This calibration aligns interpretation standards and reduces annotation drift

  10. [10]

    After calibration, one expert annotates the remaining reports under the agreed guidelines

    Full annotation. After calibration, one expert annotates the remaining reports under the agreed guidelines

  11. [11]

    A senior annotator audits a random sample of completed annotations to verify that each instance: • Identifies a plausible event and its cause(s) supported by the report

    Audit and correction. A senior annotator audits a random sample of completed annotations to verify that each instance: • Identifies a plausible event and its cause(s) supported by the report. • Includes relevant numerical evidence when available. • Provides an analytical explanation rather than a descriptive summary. Annotations that fail these checks are...

  12. [12]

    The candidate must clearly state the same central hukm (e.g., permissibility/prohi- bition, validity/invalidity) and include the key justification present in the ground truth

    Coverage of core ruling (0–4). The candidate must clearly state the same central hukm (e.g., permissibility/prohi- bition, validity/invalidity) and include the key justification present in the ground truth. One-word/minimal answers without essential justification should receive a much lower score (e.g., 0–1)

  13. [13]

    Does it retain critical restrictions, qualifiers, or carve-outs that materi- ally affect the ruling?

    Conditions, exceptions, constraints (0–2). Does it retain critical restrictions, qualifiers, or carve-outs that materi- ally affect the ruling?

  14. [14]

    No misstatements that would change the fatwa; no implicit legalization of pro- hibited elements (e.g., ribā); no misleading generalizations or invented requirements

    Doctrinal/factual accuracy (0–2). No misstatements that would change the fatwa; no implicit legalization of pro- hibited elements (e.g., ribā); no misleading generalizations or invented requirements

  15. [15]

    Clear Arabic, understandable structure, minimal ambiguity appropriate for a fatwa answer

    Clarity & Arabic language quality (0–1). Clear Arabic, understandable structure, minimal ambiguity appropriate for a fatwa answer

  16. [16]

    Directly answers the question; avoids long digressions; phrasing suitable for a fatwa

    Directness & fatwa format (0–1). Directly answers the question; avoids long digressions; phrasing suitable for a fatwa. Critical checks (true/false). • contradicts_ground_truth: Does the candidate contradict the central ruling? • omits_critical_conditions: Does it omit key conditions/exceptions that change the ruling? • introduces_unlawful_elements: Does ...

  17. [17]

    Coverage of core ruling (0–4)

  18. [18]

    Conditions, exceptions, constraints (0–2)

  19. [19]

    Doctrinal/factual accuracy (0–2)

  20. [20]

    Clarity & Arabic language quality (0–1)

  21. [21]

    Critical checks (true/false)

    Directness & on-topic (0–1). Critical checks (true/false). • contradicts_ground_truth • omits_critical_conditions • introduces_unlawful_elements • hallucinated_citations • non_answer_or_evasive • off_topic_or_unsafe Output format (strict). Output only valid JSON (no prose, no code fences), following this schema: { ”scores”: {”coverage_core_ruling”: <float...

  22. [22]

    Does the candidate capture the main thesis and key takeaways of the ground truth (what/why/so-what)?

    Core conclusion alignment (0–4). Does the candidate capture the main thesis and key takeaways of the ground truth (what/why/so-what)?

  23. [23]

    Correctly cites/uses the reported numbers (e.g., percentages, amounts, maturities, oversubscription) without inventing or altering figures

    Quantitative fidelity & use of figures (0–2). Correctly cites/uses the reported numbers (e.g., percentages, amounts, maturities, oversubscription) without inventing or altering figures. Any simple computations/comparisons must be consistent

  24. [24]

    Causality and mechanisms are plausible and consistent with standard fi- nance/econ logic (e.g., pricing vs

    Financial reasoning soundness (0–2). Causality and mechanisms are plausible and consistent with standard fi- nance/econ logic (e.g., pricing vs. credit risk, duration/tenor structure, demand/oversubscription signals, capital adequacy)

  25. [25]

    Clear Arabic, coherent structure, minimal ambiguity

    Clarity & Arabic language quality (0–1). Clear Arabic, coherent structure, minimal ambiguity

  26. [26]

    Significant Dispute

    Directness & on-topic grounding (0–1). Answers what was asked; stays anchored in the provided scenario/data (no generic filler). Critical checks (true/false). • contradicts_ground_truth: contradicts the central conclusion of the reference • fabricates_or_alters_numbers: introduces numbers not present or materially distorts reported figures • hallucinates_...