QU-NLP at QIAS 2026: Multi-Stage QLoRA Fine-Tuning for Arabic Islamic Inheritance Reasoning

Mohammad AL-Smadi

arxiv: 2604.16396 · v1 · submitted 2026-03-29 · 💻 cs.CL

QU-NLP at QIAS 2026: Multi-Stage QLoRA Fine-Tuning for Arabic Islamic Inheritance Reasoning

Mohammad AL-Smadi This is my paper

Pith reviewed 2026-05-14 22:11 UTC · model grok-4.3

classification 💻 cs.CL

keywords QLoRA fine-tuningArabic NLPIslamic inheritancelegal reasoningsmall language modelsstructured outputdomain adaptationQwen3

0 comments

The pith

Multi-stage QLoRA fine-tuning lets a 4B model reach 90% on Arabic Islamic inheritance reasoning tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that a small language model can perform multi-step legal reasoning in Islamic inheritance law by first adapting to domain-specific fatwa texts and then training on structured cases for precise JSON outputs. This two-stage process uses quantized low-rank adaptation to keep resource demands low while matching the performance of much larger commercial systems. A sympathetic reader would care because it shows specialized fine-tuning can make complex rule-based and fractional calculations feasible on accessible hardware rather than requiring massive infrastructure.

Core claim

Domain adaptation on 3,166 Islamic fatwa records followed by task-specific training on 12,000 structured inheritance cases, using 4-bit NF4 quantization and rank-128 LoRA adapters on Qwen3-4B, produces 90% MIR-E score on the test set and enables competitive structured legal reasoning compared to systems such as Gemini-2.5-flash.

What carries the argument

Multi-stage QLoRA fine-tuning: initial domain adaptation on fatwas to acquire terminology and patterns, followed by task-specific training on structured cases to optimize JSON-formatted output.

If this is right

Small models can execute multi-step fractional calculations and blocking decisions required by inheritance law.
Domain pre-adaptation followed by structured output training produces competitive results against larger systems.
Quantized low-rank adaptation keeps computational costs low while maintaining high accuracy on jurisprudential tasks.
JSON-formatted legal outputs become reliable after the second training stage.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same two-stage pattern could transfer to other rule-heavy Arabic domains such as contract law or religious rulings.
On-device deployment becomes realistic for users needing quick inheritance calculations without cloud access.
Generalization tests on unseen schools of Islamic jurisprudence would clarify the limits of the current adaptation.

Load-bearing premise

The 3,166 fatwa records and 12,000 structured cases represent real-world queries well enough that the MIR-E metric measures actual legal reasoning rather than surface pattern matching.

What would settle it

Run the model on a fresh collection of inheritance problems that combine rules and heir configurations absent from both the fatwa and structured training sets, then check whether MIR-E accuracy drops well below 90%.

read the original abstract

Islamic inheritance law (ilm al-mawar{\i}th) presents a challenging domain for evaluating large language models' structured reasoning capabilities, requiring multi-step legal analysis, rule-based blocking decisions, and precise fractional calculations. We present QU-NLP's submission to the QIAS 2026 shared task on Arabic Islamic inheritance reasoning. Our approach employs a multi-stage Quantized Low-Rank Adaptation (QLoRA) fine-tuning strategy on Qwen3-4B: (1) domain adaptation on 3,166 Islamic fatwa records to acquire inheritance terminology and jurisprudential reasoning patterns, followed by (2) task-specific training on 12,000 structured inheritance cases to optimize JSON-formatted output generation. Using 4-bit NF4 quantization with rank-128 LoRA adapters, our model achieves 90% MIR-E (Mawarith Inheritance Reasoning Evaluation) score on the test set, demonstrating competitive performance while requiring minimal computational resources. Our results show that domain-specific pre-adaptation combined with structured output training enables small language models to perform complex legal reasoning tasks effectively comparing to commercial systems such as Gemini-2.5-flash.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Standard two-stage QLoRA on Qwen3-4B hits 90% MIR-E on Arabic inheritance but the score lacks any breakdown or baselines to show real reasoning over output mimicry.

read the letter

This paper describes a two-stage QLoRA fine-tune of Qwen3-4B for the QIAS 2026 shared task on Arabic Islamic inheritance reasoning. They first adapt on 3,166 fatwa records to pick up terminology and patterns, then train on 12,000 structured cases to generate JSON outputs. The reported result is 90% on the MIR-E test set, presented as competitive with Gemini-2.5-flash at low compute cost using 4-bit NF4 and rank-128 adapters. The setup is clear and the numbers are concrete, which is the main practical value for anyone replicating domain adaptation on narrow legal tasks in Arabic. The two-stage sequence is a sensible way to handle the shift from raw text to structured output, and listing the exact hyperparameters helps with follow-up work. That said, the evaluation is thin. No baselines appear, no ablation tests the contribution of the fatwa stage, and there is no error analysis or component breakdown of MIR-E. The metric itself is described only at a high level, so it is unclear how much of the 90% reflects correct fractional calculations and blocking rules versus simply reproducing the JSON schema and frequent patterns from the 12,000 training cases. The stress-test concern about surface-level formatting holds up on the available details; without out-of-distribution tests or per-rule accuracy, the claim of complex legal reasoning rests on a single aggregate number. This is a system paper for shared-task participants and researchers doing efficient fine-tuning on specialized domains. It gives a usable recipe and a concrete score, but the evidence for depth is limited. It deserves peer review for the workshop proceedings because the method is reproducible and the result is stated plainly, though reviewers will need to push for more diagnostics before the work can be cited with confidence.

Referee Report

3 major / 1 minor

Summary. The manuscript describes QU-NLP's submission to the QIAS 2026 shared task on Arabic Islamic inheritance reasoning. It applies a two-stage QLoRA fine-tuning procedure to Qwen3-4B: domain adaptation on 3,166 Islamic fatwa records followed by task-specific training on 12,000 structured inheritance cases using 4-bit NF4 quantization and rank-128 adapters. The model is reported to reach 90% on the MIR-E metric on the test set and is positioned as competitive with Gemini-2.5-flash while using minimal computational resources.

Significance. If the 90% MIR-E score reflects verified multi-step legal reasoning (including blocking rules and fractional share calculations) rather than output-format mimicry, the work would demonstrate that small quantized models can perform complex domain-specific reasoning with limited resources. This could support broader adoption of efficient fine-tuning pipelines for specialized legal and regulatory tasks.

major comments (3)

[Abstract] Abstract: The MIR-E metric is referenced only at a high level with no definition, component breakdown (e.g., accuracy on heir identification, blocking conditions, or fractional arithmetic), or error analysis. Without these details it is impossible to determine whether the 90% score measures genuine reasoning or surface-level JSON pattern matching after training on 12,000 structured cases.
[Abstract] Abstract: The claim of competitive performance versus Gemini-2.5-flash is unsupported by any numerical baseline, direct comparison table, or statistical test. No results from the shared-task leaderboard or alternative models are reported.
[Results] No ablation or generalization experiments are described. The two-stage procedure is presented without controls that isolate the contribution of the fatwa pre-adaptation stage or that test performance on unseen combinations of heirs and blocking rules outside the 12,000 training cases.

minor comments (1)

[Abstract] Abstract: The phrase 'comparing to commercial systems' should read 'compared to commercial systems'.

Simulated Author's Rebuttal

3 responses · 0 unresolved

Thank you for the constructive feedback on our submission to the QIAS 2026 shared task. We address each major comment point by point below. Where revisions are feasible, we will update the manuscript to improve clarity and rigor while remaining faithful to the experiments performed.

read point-by-point responses

Referee: [Abstract] Abstract: The MIR-E metric is referenced only at a high level with no definition, component breakdown (e.g., accuracy on heir identification, blocking conditions, or fractional arithmetic), or error analysis. Without these details it is impossible to determine whether the 90% score measures genuine reasoning or surface-level JSON pattern matching after training on 12,000 structured cases.

Authors: We agree the abstract should define MIR-E more explicitly. In the revision we will add: 'MIR-E evaluates accuracy across heir identification, application of blocking rules, and precise fractional share calculations in structured JSON output.' A component-wise breakdown and error analysis already appear in the Results section; we will reference this explicitly from the abstract to demonstrate that the score reflects multi-step legal reasoning rather than format mimicry. revision: yes
Referee: [Abstract] Abstract: The claim of competitive performance versus Gemini-2.5-flash is unsupported by any numerical baseline, direct comparison table, or statistical test. No results from the shared-task leaderboard or alternative models are reported.

Authors: The original phrasing intended to highlight resource efficiency rather than direct numerical superiority. Because Gemini-2.5-flash was not part of the official shared-task evaluation, we lack official leaderboard numbers for it. We will revise the abstract to remove the specific Gemini comparison and instead emphasize that the 90% MIR-E score was obtained with a single-GPU QLoRA setup, positioning the result in terms of accessibility for domain-specific legal tasks. revision: yes
Referee: [Results] No ablation or generalization experiments are described. The two-stage procedure is presented without controls that isolate the contribution of the fatwa pre-adaptation stage or that test performance on unseen combinations of heirs and blocking rules outside the 12,000 training cases.

Authors: We acknowledge that explicit ablation studies isolating the fatwa adaptation stage and tests on novel heir combinations would strengthen the claims. Due to shared-task time and compute limits we did not run full controls; the two-stage design was selected after small-scale pilots showed improved terminology handling. In revision we will add a dedicated Limitations paragraph explaining this design choice and the absence of ablations, while noting that test-set performance provides indirect evidence of generalization within the task distribution. revision: partial

Circularity Check

0 steps flagged

No significant circularity in empirical fine-tuning results

full rationale

The paper reports an empirical multi-stage QLoRA fine-tuning procedure on 3,166 fatwa records followed by 12,000 structured cases, with performance measured by the MIR-E score on a held-out test set. No equations, mathematical derivations, uniqueness theorems, or ansatzes are present. The central claim is an experimental outcome (90% MIR-E) rather than a prediction derived from fitted parameters or self-referential definitions. No self-citations are load-bearing, and the MIR-E metric functions as an external evaluation score rather than a quantity defined in terms of the model's own outputs. The derivation chain is self-contained as standard supervised fine-tuning and evaluation.

Axiom & Free-Parameter Ledger

2 free parameters · 0 axioms · 0 invented entities

The central claim rests on standard supervised fine-tuning assumptions and the representativeness of the provided training corpora; no new entities or axioms are introduced.

free parameters (2)

LoRA rank = 128
Set to 128 for the adapters in both stages.
Quantization precision = 4-bit NF4
Fixed at 4-bit NF4.

pith-pipeline@v0.9.0 · 5503 in / 1057 out tokens · 36200 ms · 2026-05-14T22:11:11.421348+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

3 extracted references · 3 canonical work pages · 1 internal anchor

[1]

Abdessalam Bouchekif, Shahd Gaben, Samer Rashwani, Somaya Eltanbouly, Mutaz Al-Khatib, Heba Sbahi, Mohammed Ghaly, and Emad Mo- hamed

From RAG to agentic RAG for faithful islamic question answering. Abdessalam Bouchekif, Shahd Gaben, Samer Rashwani, Somaya Eltanbouly, Mutaz Al-Khatib, Heba Sbahi, Mohammed Ghaly, and Emad Mo- hamed. 2026. MAWARITH: A dataset and bench- mark for legal inheritance reasoning with llms. Abdessalam Bouchekif, Samer Rashwani, Emad Soliman Ali Mohamed, Mutaz Al...

work page 2026
[2]

Edward J

Deepseek-r1 incentivizes reasoning in LLMs through reinforcement learning.Nature, 645(8081):633–638. Edward J. Hu, Y elong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. 2021. LoRA: Low-rank adap- tation of large language models. Rana Malhas, Watheq Mansour, and Tamer El- sayed. 2022. Qur’an QA 2022: Overview of ...

work page 2021
[3]

Qwen3 Technical Report

LEXTREME: A multi-lingual and multi-task benchmark for the legal domain. InFindings of the Association for Computational Linguistics: EMNLP 2023, pages 3016–3054, Singapore. As- sociation for Computational Linguistics. OpenAI, Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Floren- cia Leoni Aleman, Diogo Almeida, Janko Al- tenschmid...

work page internal anchor Pith review Pith/arXiv arXiv 2023

[1] [1]

Abdessalam Bouchekif, Shahd Gaben, Samer Rashwani, Somaya Eltanbouly, Mutaz Al-Khatib, Heba Sbahi, Mohammed Ghaly, and Emad Mo- hamed

From RAG to agentic RAG for faithful islamic question answering. Abdessalam Bouchekif, Shahd Gaben, Samer Rashwani, Somaya Eltanbouly, Mutaz Al-Khatib, Heba Sbahi, Mohammed Ghaly, and Emad Mo- hamed. 2026. MAWARITH: A dataset and bench- mark for legal inheritance reasoning with llms. Abdessalam Bouchekif, Samer Rashwani, Emad Soliman Ali Mohamed, Mutaz Al...

work page 2026

[2] [2]

Edward J

Deepseek-r1 incentivizes reasoning in LLMs through reinforcement learning.Nature, 645(8081):633–638. Edward J. Hu, Y elong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. 2021. LoRA: Low-rank adap- tation of large language models. Rana Malhas, Watheq Mansour, and Tamer El- sayed. 2022. Qur’an QA 2022: Overview of ...

work page 2021

[3] [3]

Qwen3 Technical Report

LEXTREME: A multi-lingual and multi-task benchmark for the legal domain. InFindings of the Association for Computational Linguistics: EMNLP 2023, pages 3016–3054, Singapore. As- sociation for Computational Linguistics. OpenAI, Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Floren- cia Leoni Aleman, Diogo Almeida, Janko Al- tenschmid...

work page internal anchor Pith review Pith/arXiv arXiv 2023