arxiv: 2604.19098 · v2 · submitted 2026-04-21 · 💻 cs.CL · cs.AI· cs.LG

Recognition: unknown

SAHM: A Benchmark for Arabic Financial and Shari'ah-Compliant Reasoning

Rania Elbadry , Sarfraz Ahmad , Ahmed Heakl , Dani Bouch , Momina Ahsan , Muhra AlMahri , Marwa Elsaid khalil , Yuxia Wang

show 7 more authors

Salem Lahlou Sophia Ananiadou Veselin Stoyanov Jimin Huang Xueqing Peng Preslav Nakov Zhuohan Xie

Authors on Pith no claims yet

Pith reviewed 2026-05-10 03:08 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.LG

keywords Arabic NLPfinancial benchmarkShari'ah complianceLLM evaluationIslamic financeevent reasoning

0 comments

The pith

A new benchmark shows Arabic fluency in LLMs does not ensure financial reasoning ability.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces SAHM, the first benchmark for Arabic financial and Shari'ah-compliant reasoning. It covers seven tasks including standards QA, fatwa questions, accounting exams, sentiment analysis, summarization, and event-cause reasoning with over 14,000 expert-verified examples from real sources. Testing twenty language models reveals that strong performance on language recognition tasks does not translate to generation tasks or complex reasoning like identifying causes of financial events. This matters because Arabic-speaking regions have massive Islamic finance markets that require precise compliance knowledge. The benchmark aims to support development of more reliable Arabic financial AI tools.

Core claim

We introduce Sahm, the first Arabic financial benchmark spanning seven tasks: AAOIFI standards QA, fatwa-based QA/MCQ, accounting and business exams, financial sentiment analysis, extractive summarization, and event-cause reasoning, comprising 14,380 expert-verified instances from authentic regulatory, juristic, and corporate sources. Evaluating 20 LLMs, we find Arabic fluency does not imply financial reasoning: models achieving 91% on recognition tasks drop sharply on generation, and event-cause reasoning exposes the widest performance gap (1.89-9.84/10). We release the benchmark and dataset to support trustworthy Arabic financial assistants.

What carries the argument

The SAHM benchmark with its seven specific tasks drawn from authentic Arabic financial and regulatory sources to evaluate LLMs on Shari'ah-compliant reasoning.

Load-bearing premise

The selected seven tasks and expert-verified instances from regulatory sources comprehensively and without bias represent the full scope of Arabic financial and Shari'ah-compliant reasoning needs across regions and dialects.

What would settle it

A new LLM achieving high performance on all tasks, including event-cause reasoning, after only general Arabic training without financial-specific data would challenge the identified performance gaps.

Figures

Figures reproduced from arXiv: 2604.19098 by Ahmed Heakl, Dani Bouch, Jimin Huang, Marwa Elsaid khalil, Momina Ahsan, Muhra AlMahri, Preslav Nakov, Rania Elbadry, Salem Lahlou, Sarfraz Ahmad, Sophia Ananiadou, Veselin Stoyanov, Xueqing Peng, Yuxia Wang, Zhuohan Xie.

**Figure 1.** Figure 1: Examples of the diverse tasks included in SAHM, covering juristic Q&A, business and accounting MCQs, financial sentiment analysis, report summarization, & event causal reasoning. • A comprehensive benchmark of 20 LLMs showing that Arabic fluency does not guarantee financial reasoning: models that score up to 91% on MCQ-style tasks degrade substantially on open-ended generation, with the largest gap on Ev… view at source ↗

**Figure 2.** Figure 2: Pipeline for constructing the Islamic Finance Shari’ah Standards QA dataset. A hybrid LLMshuman pipeline converts AAOIFI standards into QA pairs through OCR and generation stages, each followed by expert verification to ensure linguistic accuracy and legal fidelity. 3 SAHM We introduce SAHM, a comprehensive benchmark for evaluating Arabic financial reasoning across diverse, real-world tasks spanning Isla… view at source ↗

**Figure 3.** Figure 3: Effect of reasoning token budget on ruling accuracy. Green indicates improvement with increased budget, red indicates decline, and blue indicates no change. We validate the judge with two expert Arabic annotators on 200 randomly sampled outputs across the three tasks (MSE 0.41, Pearson r=0.92; interannotator agreement κ=0.84 on discretized scores; Appendix J). All judge and model generations use greedy… view at source ↗

**Figure 4.** Figure 4: Qualitative error analysis showing representative failure modes. Left: Islamic knowledge error where Gemma-3-27B incorrectly rules a permissible transaction as forbidden, citing fabricated evidence with wrong wording of authentic Hadith. Right: Concept confusion error where Qwen2.5-72B conflates total interest incurred with capitalizable interest in a construction loan scenario. 0 100 200 300 400 500 Numb… view at source ↗

**Figure 5.** Figure 5: Models Talk More, Not Better. Despite models generating 4-6× more fatwas text than human, models do not achieve proportionally higher accuracy, indicating that verbosity serves as proxy for uncertainty rather than expertise. 5 Results We analyze model behavior across tasks to understand the relationship between Arabic fluency and financial reasoning. Our findings highlight consistent patterns in performa… view at source ↗

**Figure 6.** Figure 6: Root cause distribution of model errors across [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗

**Figure 7.** Figure 7: Effect of number of evidences from Hadith [PITH_FULL_IMAGE:figures/full_fig_p009_7.png] view at source ↗

**Figure 8.** Figure 8: OCR quality evaluation interface for the Shari’ah Standards QA dataset. The tool displays each scanned [PITH_FULL_IMAGE:figures/full_fig_p014_8.png] view at source ↗

**Figure 9.** Figure 9: Prompt for Arabic OCR text extraction with [PITH_FULL_IMAGE:figures/full_fig_p015_9.png] view at source ↗

**Figure 10.** Figure 10: Prompt for generating Arabic QA pairs from [PITH_FULL_IMAGE:figures/full_fig_p015_10.png] view at source ↗

**Figure 11.** Figure 11: Custom annotation interface used to validate automatically generated multiple-choice questions (MCQs) [PITH_FULL_IMAGE:figures/full_fig_p016_11.png] view at source ↗

**Figure 12.** Figure 12: Prompt for Arabic fatwā Q&A normalization with minimal editing and preservation of juristic intent. Category Total Count 4,888 (زႤ၍ة) Zakat 2,454 (رً؇) Riba Murabaha (۰ොຳاਵਦ) 1,389 Gharar (رਵؗ) 860 730 (وڢژ) Waqf 571 (إ༥؇رة) Ijara Maysir (๤ཏ྘݁) 372 242 (݁ލ؇رᄎც (Musharaka 228 (݁ݯ؇ر۰ً) Mudharaba Takaful (ڣܭႤၽّ) 187 32 (ݬܝިك) Sukuk Total records 11,953 [PITH_FULL_IMAGE:figures/full_fig_p018_12.png] view at source ↗

**Figure 13.** Figure 13: Custom annotation platform used to label Arabic financial reports for sentiment analysis. Annotators [PITH_FULL_IMAGE:figures/full_fig_p019_13.png] view at source ↗

**Figure 14.** Figure 14: Guidelines for document-level sentiment an [PITH_FULL_IMAGE:figures/full_fig_p019_14.png] view at source ↗

**Figure 15.** Figure 15: Custom web-based annotation interface for extractive summarization. Annotators view Arabic financial [PITH_FULL_IMAGE:figures/full_fig_p020_15.png] view at source ↗

**Figure 16.** Figure 16: Guidelines for extractive summarization an [PITH_FULL_IMAGE:figures/full_fig_p022_16.png] view at source ↗

**Figure 17.** Figure 17: Guidelines and quality control workflow for event–cause reasoning annotation in Arabic financial re [PITH_FULL_IMAGE:figures/full_fig_p023_17.png] view at source ↗

**Figure 18.** Figure 18: Prompt for extracting MCQs from Arabic accounting exams with exercise-based layouts. Prompt for Extracting Arabic Business Exam MCQs (Tabular Format) Task. You are an expert system for extracting Arabic business and accounting exam questions from scanned images containing tabular layouts. Document characteristics. • Each row corresponds to one question. • Questions are numbered using Arabic numerals (e.g… view at source ↗

**Figure 20.** Figure 20: Evaluation rubric used for LLM-based judgment of fatwa QA responses. [PITH_FULL_IMAGE:figures/full_fig_p026_20.png] view at source ↗

**Figure 21.** Figure 21: Evaluation rubric used for LLM-based judg [PITH_FULL_IMAGE:figures/full_fig_p027_21.png] view at source ↗

**Figure 22.** Figure 22: Evaluation rubric used for LLM-based judgment of financial analysis and event–cause reasoning tasks. [PITH_FULL_IMAGE:figures/full_fig_p028_22.png] view at source ↗

read the original abstract

English financial NLP has advanced rapidly through benchmarks targeting earnings analysis, market sentiment, tabular reasoning, and financial question answering, yet Arabic financial NLP remains virtually nonexistent, despite 422 million speakers, $4.9 trillion in Gulf sovereign wealth, and a $4-5 trillion Islamic finance industry requiring specialized Shari'ah compliance over instruments like sukuk, murabaha, and takaful. We introduce Sahm, the first Arabic financial benchmark spanning seven tasks: AAOIFI standards QA, fatwa-based QA/MCQ, accounting and business exams, financial sentiment analysis, extractive summarization, and event-cause reasoning, comprising 14,380 expert-verified instances from authentic regulatory, juristic, and corporate sources. Evaluating 20 LLMs, we find Arabic fluency does not imply financial reasoning: models achieving 91% on recognition tasks drop sharply on generation, and event-cause reasoning exposes the widest performance gap (1.89-9.84/10). We release the benchmark and dataset to support trustworthy Arabic financial assistants.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

SAHM is the first Arabic financial benchmark with Shari'ah tasks from real sources, and the LLM evaluation shows clear gaps, but the expert verification lacks agreement metrics.

read the letter

This paper's main contribution is the first benchmark for Arabic financial and Shari'ah-compliant reasoning. It pulls 14,380 instances from regulatory, juristic, and corporate sources into seven tasks that mix standard financial NLP with AAOIFI standards QA, fatwa-based questions, and event-cause reasoning. The authors test 20 LLMs and report that strong Arabic recognition performance does not carry over to generation or deeper reasoning, with the widest drops on event-cause tasks. Releasing the data is a practical step for anyone building Arabic financial tools. The work does well by targeting a genuine gap. English financial benchmarks exist, but nothing comparable covers Arabic plus Islamic compliance needs for a large speaker base and multi-trillion-dollar sector. The task selection and source choices are grounded, and the results are presented with specific numbers rather than vague claims. The soft spot is the expert verification process. The paper states the instances are expert-verified, yet it provides no inter-annotator agreement figures, expert qualification details, or disagreement resolution steps. This matters most for interpretive tasks like fatwas and standards, where rulings can vary. Without those numbers, the reported performance gaps are harder to trust fully, even if the overall direction of the findings holds. Task coverage is also a minor limitation; the seven areas are a reasonable start but may not capture all dialects or regional differences in Islamic finance. This paper is for researchers in multilingual NLP or financial applications who need Arabic data and baselines. A reader focused on low-resource domains or domain-specific benchmarks would get direct value from the dataset and the initial model comparisons. It shows honest engagement with the literature on financial NLP and a clear motivation for the new resource. I would send it for peer review, with the main request being more transparency on the annotation protocol.

Referee Report

2 major / 2 minor

Summary. The paper introduces SAHM, the first Arabic financial and Shari'ah-compliant reasoning benchmark comprising 14,380 expert-verified instances drawn from authentic regulatory, juristic, and corporate sources. It covers seven tasks (AAOIFI standards QA, fatwa-based QA/MCQ, accounting/business exams, financial sentiment analysis, extractive summarization, and event-cause reasoning) and evaluates 20 LLMs, concluding that Arabic fluency does not imply financial reasoning: models reach 91% on recognition tasks but drop sharply on generation, with the widest gaps on event-cause reasoning (1.89-9.84/10). The benchmark and dataset are released publicly.

Significance. If the dataset labels prove reliable, SAHM would fill a clear gap in Arabic financial NLP for a large speaker population and the $4-5 trillion Islamic finance industry. The empirical demonstration of task-specific performance drops and the use of authentic external sources (rather than synthetic data) are strengths; public release of the benchmark supports reproducibility and further work.

major comments (2)

[Abstract and dataset construction section] Abstract and dataset construction section: the central claim rests on 14,380 'expert-verified' instances, yet no inter-expert agreement scores, expert qualification details, or disagreement-resolution protocol are reported. This is load-bearing for interpretive tasks (fatwa-based QA and AAOIFI standards) where rulings can vary by madhhab; without these metrics, systematic label noise could affect the validity of the reported performance gaps (e.g., 1.89-9.84/10 on event-cause reasoning).
[Evaluation and results section] Evaluation and results section: the claim that 'Arabic fluency does not imply financial reasoning' is supported by raw performance numbers but lacks statistical tests (e.g., paired significance tests or confidence intervals) for the observed drops from recognition to generation tasks and the specific event-cause range. This weakens the strength of the cross-task and cross-model conclusions.

minor comments (2)

[Abstract] Abstract: instance selection criteria and per-task instance counts are not summarized, making it harder to assess coverage of the claimed seven tasks.
[Throughout] Throughout: Arabic terms (e.g., sukuk, murabaha, takaful) would benefit from consistent transliteration and brief English glosses on first use for broader readability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback, which helps strengthen the presentation of our benchmark. We address each major comment below and commit to revisions that improve transparency and rigor without altering the core findings.

read point-by-point responses

Referee: [Abstract and dataset construction section] Abstract and dataset construction section: the central claim rests on 14,380 'expert-verified' instances, yet no inter-expert agreement scores, expert qualification details, or disagreement-resolution protocol are reported. This is load-bearing for interpretive tasks (fatwa-based QA and AAOIFI standards) where rulings can vary by madhhab; without these metrics, systematic label noise could affect the validity of the reported performance gaps (e.g., 1.89-9.84/10 on event-cause reasoning).

Authors: We agree that explicit details on the verification process are essential for establishing benchmark reliability, especially for interpretive tasks. In the revised manuscript, we will expand the dataset construction section to describe the experts' qualifications (e.g., certified Shari'ah scholars with AAOIFI or equivalent credentials and professional accountants), the multi-stage review protocol used for disagreement resolution, and any available agreement metrics from the verification process. We will also add a limitations paragraph addressing potential madhhab-based variations and label noise, noting that sources were selected from authoritative, consensus-oriented regulatory bodies to mitigate this. These additions will better support the reported performance gaps. revision: yes
Referee: [Evaluation and results section] Evaluation and results section: the claim that 'Arabic fluency does not imply financial reasoning' is supported by raw performance numbers but lacks statistical tests (e.g., paired significance tests or confidence intervals) for the observed drops from recognition to generation tasks and the specific event-cause range. This weakens the strength of the cross-task and cross-model conclusions.

Authors: We concur that statistical support would strengthen the cross-task and cross-model claims. In the revised evaluation and results section, we will add 95% confidence intervals for all task scores and include paired statistical tests (e.g., Wilcoxon signed-rank tests for recognition vs. generation drops and appropriate post-hoc tests for event-cause reasoning ranges across models). This will provide quantitative evidence for the performance disparities while preserving the original empirical observations. revision: yes

Circularity Check

0 steps flagged

No circularity: benchmark compiled from external sources and evaluated directly on LLMs

full rationale

The paper introduces SAHM by collecting 14,380 instances from authentic regulatory, juristic, and corporate sources, followed by expert verification. It then runs standard evaluations of 20 off-the-shelf LLMs on seven tasks and reports raw performance numbers (e.g., recognition vs. generation gaps). No equations, fitted parameters, self-definitional claims, or load-bearing self-citations appear in the derivation chain. The central results are direct measurements against external data and models, not reductions to the paper's own inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the construction of a new benchmark from authentic external sources with expert verification; no free parameters or invented entities are introduced, and the main unstated premise is that the chosen tasks and sources adequately cover the domain.

axioms (1)

domain assumption Expert verification by domain specialists ensures the quality, accuracy, and representativeness of the 14,380 benchmark instances.
Invoked to support the claim that the dataset is reliable for evaluating financial and Shari'ah reasoning.

pith-pipeline@v0.9.0 · 5549 in / 1502 out tokens · 56181 ms · 2026-05-10T03:08:46.977993+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

26 extracted references · 3 canonical work pages · 2 internal anchors

[1]

Gemma 3 Technical Report

Gemma 3 technical report . ArXiv preprint , abs/2503.19786. Chin- Y ew Lin. 2004. ROUGE: A package for auto- matic evaluation of summaries . In Text Summariza- tion Branches Out , pages 74–81, Barcelona, Spain. Association for Computational Linguistics. Macedo Maia, Siegfried Handschuh, André Freitas, Brian Davis, Ross McDermott, Manel Zarrouk, and Alexan...

work page internal anchor Pith review Pith/arXiv arXiv 2004
[2]

CFinBench: A comprehensive Chinese finan- cial benchmark for large language models . In Pro- ceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Com- putational Linguistics: Human Language Technolo- gies (Volume 1: Long Papers) , NAACL ’25, pages 876–891, Albuquerque, New Mexico. Association for Computational Ling...

work page arXiv 2025
[3]

FinChain: A Symbolic Benchmark for Verifiable Chain-of-Thought Financial Reasoning

FinChain: A symbolic benchmark for verifi- able chain-of-thought financial reasoning . Preprint, arXiv:2506.02515. An Y ang, Baosong Y ang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, Huan Lin, Jian Y ang, Jian- hong Tu, Jianwei Zhang, Jianxin Y ang, Jiaxi Y ang, Jingren Zhou, Junyang Lin, Kai Dang, K...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[4]

Before editing, set IS_MAINLY_REFERRAL: • ”YES” if the answer mainly redirects to another fatwā, link, or reference and does not provide a substantive independent ruling

Referral flag. Before editing, set IS_MAINLY_REFERRAL: • ”YES” if the answer mainly redirects to another fatwā, link, or reference and does not provide a substantive independent ruling. • ”NO” otherwise
[5]

Clean the question. Edit minimally while pre- serving wording and fiqh intent: • Remove greetings, honorifics, and personal ap- peals (e.g.,ᆙᆊ ݿﺍܳފ.) •Remove formal closings (e.g.,ﺃﺭۏި݁ٷ ﻭ ﺍ.) •Remove the scholar’s name if it is only a form of address; keep it only if the question explicitly seeks that scholar’s specific fatwā or opinion. • Ensure the ...
[6]

ّأݞߌ߳ﺍࡺ࢘ࢦިلܭﺍ ﻭﺍܳފٷڎﺍﺕ

Clean the answer. Edit minimally while preserv- ing wording and reasoning: • Remove formal openings and closings so the an- swer starts with substantive content. • Remove all fatwā numbers, hyperlinks, and nav- igational phrases, editing surrounding text just enough to remain grammatical. • Convert Arabic-Indic numerals to Western numer- als. • Remove pur...
[7]

T wo native Arabic financial experts independently annotate a pilot subset of 20 reports, each producing an event–cause question and an analytical answer

Pilot annotation. T wo native Arabic financial experts independently annotate a pilot subset of 20 reports, each producing an event–cause question and an analytical answer
[8]

We evaluate agreement at two complementary levels: • Event–cause identification: measured using Cohen’s κ, assessing consistency in identifying salient events and their causes

Agreement assessment. We evaluate agreement at two complementary levels: • Event–cause identification: measured using Cohen’s κ, assessing consistency in identifying salient events and their causes. • Answer consistency: measured using ROUGE overlap between independently written answers, used as a consis- tency check rather than a correctness metric
[9]

Calibration. Annotators review disagreements from the pilot phase, discuss ambiguous cases (e.g., implicit causal- ity, multi-factor events, overlapping economic drivers), and refine shared annotation criteria. This calibration aligns interpretation standards and reduces annotation drift
[10]

After calibration, one expert annotates the remaining reports under the agreed guidelines

Full annotation. After calibration, one expert annotates the remaining reports under the agreed guidelines
[11]

A senior annotator audits a random sample of completed annotations to verify that each instance: • Identifies a plausible event and its cause(s) supported by the report

Audit and correction. A senior annotator audits a random sample of completed annotations to verify that each instance: • Identifies a plausible event and its cause(s) supported by the report. • Includes relevant numerical evidence when available. • Provides an analytical explanation rather than a descriptive summary. Annotations that fail these checks are...
[12]

The candidate must clearly state the same central hukm (e.g., permissibility/prohi- bition, validity/invalidity) and include the key justification present in the ground truth

Coverage of core ruling (0–4). The candidate must clearly state the same central hukm (e.g., permissibility/prohi- bition, validity/invalidity) and include the key justification present in the ground truth. One-word/minimal answers without essential justification should receive a much lower score (e.g., 0–1)
[13]

Does it retain critical restrictions, qualifiers, or carve-outs that materi- ally affect the ruling?

Conditions, exceptions, constraints (0–2). Does it retain critical restrictions, qualifiers, or carve-outs that materi- ally affect the ruling?
[14]

No misstatements that would change the fatwa; no implicit legalization of pro- hibited elements (e.g., ribā); no misleading generalizations or invented requirements

Doctrinal/factual accuracy (0–2). No misstatements that would change the fatwa; no implicit legalization of pro- hibited elements (e.g., ribā); no misleading generalizations or invented requirements
[15]

Clear Arabic, understandable structure, minimal ambiguity appropriate for a fatwa answer

Clarity & Arabic language quality (0–1). Clear Arabic, understandable structure, minimal ambiguity appropriate for a fatwa answer
[16]

Directly answers the question; avoids long digressions; phrasing suitable for a fatwa

Directness & fatwa format (0–1). Directly answers the question; avoids long digressions; phrasing suitable for a fatwa. Critical checks (true/false). • contradicts_ground_truth: Does the candidate contradict the central ruling? • omits_critical_conditions: Does it omit key conditions/exceptions that change the ruling? • introduces_unlawful_elements: Does ...
[17]

Coverage of core ruling (0–4)
[18]

Conditions, exceptions, constraints (0–2)
[19]

Doctrinal/factual accuracy (0–2)
[20]

Clarity & Arabic language quality (0–1)
[21]

Critical checks (true/false)

Directness & on-topic (0–1). Critical checks (true/false). • contradicts_ground_truth • omits_critical_conditions • introduces_unlawful_elements • hallucinated_citations • non_answer_or_evasive • off_topic_or_unsafe Output format (strict). Output only valid JSON (no prose, no code fences), following this schema: { ”scores”: {”coverage_core_ruling”: <float...
[22]

Does the candidate capture the main thesis and key takeaways of the ground truth (what/why/so-what)?

Core conclusion alignment (0–4). Does the candidate capture the main thesis and key takeaways of the ground truth (what/why/so-what)?
[23]

Correctly cites/uses the reported numbers (e.g., percentages, amounts, maturities, oversubscription) without inventing or altering figures

Quantitative fidelity & use of figures (0–2). Correctly cites/uses the reported numbers (e.g., percentages, amounts, maturities, oversubscription) without inventing or altering figures. Any simple computations/comparisons must be consistent
[24]

Causality and mechanisms are plausible and consistent with standard fi- nance/econ logic (e.g., pricing vs

Financial reasoning soundness (0–2). Causality and mechanisms are plausible and consistent with standard fi- nance/econ logic (e.g., pricing vs. credit risk, duration/tenor structure, demand/oversubscription signals, capital adequacy)
[25]

Clear Arabic, coherent structure, minimal ambiguity

Clarity & Arabic language quality (0–1). Clear Arabic, coherent structure, minimal ambiguity
[26]

Significant Dispute

Directness & on-topic grounding (0–1). Answers what was asked; stays anchored in the provided scenario/data (no generic filler). Critical checks (true/false). • contradicts_ground_truth: contradicts the central conclusion of the reference • fabricates_or_alters_numbers: introduces numbers not present or materially distorts reported figures • hallucinates_...