QU-NLP at QIAS 2026: Multi-Stage QLoRA Fine-Tuning for Arabic Islamic Inheritance Reasoning
Pith reviewed 2026-05-14 22:11 UTC · model grok-4.3
The pith
Multi-stage QLoRA fine-tuning lets a 4B model reach 90% on Arabic Islamic inheritance reasoning tasks.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Domain adaptation on 3,166 Islamic fatwa records followed by task-specific training on 12,000 structured inheritance cases, using 4-bit NF4 quantization and rank-128 LoRA adapters on Qwen3-4B, produces 90% MIR-E score on the test set and enables competitive structured legal reasoning compared to systems such as Gemini-2.5-flash.
What carries the argument
Multi-stage QLoRA fine-tuning: initial domain adaptation on fatwas to acquire terminology and patterns, followed by task-specific training on structured cases to optimize JSON-formatted output.
If this is right
- Small models can execute multi-step fractional calculations and blocking decisions required by inheritance law.
- Domain pre-adaptation followed by structured output training produces competitive results against larger systems.
- Quantized low-rank adaptation keeps computational costs low while maintaining high accuracy on jurisprudential tasks.
- JSON-formatted legal outputs become reliable after the second training stage.
Where Pith is reading between the lines
- The same two-stage pattern could transfer to other rule-heavy Arabic domains such as contract law or religious rulings.
- On-device deployment becomes realistic for users needing quick inheritance calculations without cloud access.
- Generalization tests on unseen schools of Islamic jurisprudence would clarify the limits of the current adaptation.
Load-bearing premise
The 3,166 fatwa records and 12,000 structured cases represent real-world queries well enough that the MIR-E metric measures actual legal reasoning rather than surface pattern matching.
What would settle it
Run the model on a fresh collection of inheritance problems that combine rules and heir configurations absent from both the fatwa and structured training sets, then check whether MIR-E accuracy drops well below 90%.
read the original abstract
Islamic inheritance law (ilm al-mawar{\i}th) presents a challenging domain for evaluating large language models' structured reasoning capabilities, requiring multi-step legal analysis, rule-based blocking decisions, and precise fractional calculations. We present QU-NLP's submission to the QIAS 2026 shared task on Arabic Islamic inheritance reasoning. Our approach employs a multi-stage Quantized Low-Rank Adaptation (QLoRA) fine-tuning strategy on Qwen3-4B: (1) domain adaptation on 3,166 Islamic fatwa records to acquire inheritance terminology and jurisprudential reasoning patterns, followed by (2) task-specific training on 12,000 structured inheritance cases to optimize JSON-formatted output generation. Using 4-bit NF4 quantization with rank-128 LoRA adapters, our model achieves 90% MIR-E (Mawarith Inheritance Reasoning Evaluation) score on the test set, demonstrating competitive performance while requiring minimal computational resources. Our results show that domain-specific pre-adaptation combined with structured output training enables small language models to perform complex legal reasoning tasks effectively comparing to commercial systems such as Gemini-2.5-flash.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript describes QU-NLP's submission to the QIAS 2026 shared task on Arabic Islamic inheritance reasoning. It applies a two-stage QLoRA fine-tuning procedure to Qwen3-4B: domain adaptation on 3,166 Islamic fatwa records followed by task-specific training on 12,000 structured inheritance cases using 4-bit NF4 quantization and rank-128 adapters. The model is reported to reach 90% on the MIR-E metric on the test set and is positioned as competitive with Gemini-2.5-flash while using minimal computational resources.
Significance. If the 90% MIR-E score reflects verified multi-step legal reasoning (including blocking rules and fractional share calculations) rather than output-format mimicry, the work would demonstrate that small quantized models can perform complex domain-specific reasoning with limited resources. This could support broader adoption of efficient fine-tuning pipelines for specialized legal and regulatory tasks.
major comments (3)
- [Abstract] Abstract: The MIR-E metric is referenced only at a high level with no definition, component breakdown (e.g., accuracy on heir identification, blocking conditions, or fractional arithmetic), or error analysis. Without these details it is impossible to determine whether the 90% score measures genuine reasoning or surface-level JSON pattern matching after training on 12,000 structured cases.
- [Abstract] Abstract: The claim of competitive performance versus Gemini-2.5-flash is unsupported by any numerical baseline, direct comparison table, or statistical test. No results from the shared-task leaderboard or alternative models are reported.
- [Results] No ablation or generalization experiments are described. The two-stage procedure is presented without controls that isolate the contribution of the fatwa pre-adaptation stage or that test performance on unseen combinations of heirs and blocking rules outside the 12,000 training cases.
minor comments (1)
- [Abstract] Abstract: The phrase 'comparing to commercial systems' should read 'compared to commercial systems'.
Simulated Author's Rebuttal
Thank you for the constructive feedback on our submission to the QIAS 2026 shared task. We address each major comment point by point below. Where revisions are feasible, we will update the manuscript to improve clarity and rigor while remaining faithful to the experiments performed.
read point-by-point responses
-
Referee: [Abstract] Abstract: The MIR-E metric is referenced only at a high level with no definition, component breakdown (e.g., accuracy on heir identification, blocking conditions, or fractional arithmetic), or error analysis. Without these details it is impossible to determine whether the 90% score measures genuine reasoning or surface-level JSON pattern matching after training on 12,000 structured cases.
Authors: We agree the abstract should define MIR-E more explicitly. In the revision we will add: 'MIR-E evaluates accuracy across heir identification, application of blocking rules, and precise fractional share calculations in structured JSON output.' A component-wise breakdown and error analysis already appear in the Results section; we will reference this explicitly from the abstract to demonstrate that the score reflects multi-step legal reasoning rather than format mimicry. revision: yes
-
Referee: [Abstract] Abstract: The claim of competitive performance versus Gemini-2.5-flash is unsupported by any numerical baseline, direct comparison table, or statistical test. No results from the shared-task leaderboard or alternative models are reported.
Authors: The original phrasing intended to highlight resource efficiency rather than direct numerical superiority. Because Gemini-2.5-flash was not part of the official shared-task evaluation, we lack official leaderboard numbers for it. We will revise the abstract to remove the specific Gemini comparison and instead emphasize that the 90% MIR-E score was obtained with a single-GPU QLoRA setup, positioning the result in terms of accessibility for domain-specific legal tasks. revision: yes
-
Referee: [Results] No ablation or generalization experiments are described. The two-stage procedure is presented without controls that isolate the contribution of the fatwa pre-adaptation stage or that test performance on unseen combinations of heirs and blocking rules outside the 12,000 training cases.
Authors: We acknowledge that explicit ablation studies isolating the fatwa adaptation stage and tests on novel heir combinations would strengthen the claims. Due to shared-task time and compute limits we did not run full controls; the two-stage design was selected after small-scale pilots showed improved terminology handling. In revision we will add a dedicated Limitations paragraph explaining this design choice and the absence of ablations, while noting that test-set performance provides indirect evidence of generalization within the task distribution. revision: partial
Circularity Check
No significant circularity in empirical fine-tuning results
full rationale
The paper reports an empirical multi-stage QLoRA fine-tuning procedure on 3,166 fatwa records followed by 12,000 structured cases, with performance measured by the MIR-E score on a held-out test set. No equations, mathematical derivations, uniqueness theorems, or ansatzes are present. The central claim is an experimental outcome (90% MIR-E) rather than a prediction derived from fitted parameters or self-referential definitions. No self-citations are load-bearing, and the MIR-E metric functions as an external evaluation score rather than a quantity defined in terms of the model's own outputs. The derivation chain is self-contained as standard supervised fine-tuning and evaluation.
Axiom & Free-Parameter Ledger
free parameters (2)
- LoRA rank =
128
- Quantization precision =
4-bit NF4
Reference graph
Works this paper leans on
-
[1]
From RAG to agentic RAG for faithful islamic question answering. Abdessalam Bouchekif, Shahd Gaben, Samer Rashwani, Somaya Eltanbouly, Mutaz Al-Khatib, Heba Sbahi, Mohammed Ghaly, and Emad Mo- hamed. 2026. MAWARITH: A dataset and bench- mark for legal inheritance reasoning with llms. Abdessalam Bouchekif, Samer Rashwani, Emad Soliman Ali Mohamed, Mutaz Al...
work page 2026
-
[2]
Deepseek-r1 incentivizes reasoning in LLMs through reinforcement learning.Nature, 645(8081):633–638. Edward J. Hu, Y elong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. 2021. LoRA: Low-rank adap- tation of large language models. Rana Malhas, Watheq Mansour, and Tamer El- sayed. 2022. Qur’an QA 2022: Overview of ...
work page 2021
-
[3]
LEXTREME: A multi-lingual and multi-task benchmark for the legal domain. InFindings of the Association for Computational Linguistics: EMNLP 2023, pages 3016–3054, Singapore. As- sociation for Computational Linguistics. OpenAI, Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Floren- cia Leoni Aleman, Diogo Almeida, Janko Al- tenschmid...
work page internal anchor Pith review Pith/arXiv arXiv 2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.