pith. sign in

arxiv: 2604.16396 · v1 · submitted 2026-03-29 · 💻 cs.CL

QU-NLP at QIAS 2026: Multi-Stage QLoRA Fine-Tuning for Arabic Islamic Inheritance Reasoning

Pith reviewed 2026-05-14 22:11 UTC · model grok-4.3

classification 💻 cs.CL
keywords QLoRA fine-tuningArabic NLPIslamic inheritancelegal reasoningsmall language modelsstructured outputdomain adaptationQwen3
0
0 comments X

The pith

Multi-stage QLoRA fine-tuning lets a 4B model reach 90% on Arabic Islamic inheritance reasoning tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that a small language model can perform multi-step legal reasoning in Islamic inheritance law by first adapting to domain-specific fatwa texts and then training on structured cases for precise JSON outputs. This two-stage process uses quantized low-rank adaptation to keep resource demands low while matching the performance of much larger commercial systems. A sympathetic reader would care because it shows specialized fine-tuning can make complex rule-based and fractional calculations feasible on accessible hardware rather than requiring massive infrastructure.

Core claim

Domain adaptation on 3,166 Islamic fatwa records followed by task-specific training on 12,000 structured inheritance cases, using 4-bit NF4 quantization and rank-128 LoRA adapters on Qwen3-4B, produces 90% MIR-E score on the test set and enables competitive structured legal reasoning compared to systems such as Gemini-2.5-flash.

What carries the argument

Multi-stage QLoRA fine-tuning: initial domain adaptation on fatwas to acquire terminology and patterns, followed by task-specific training on structured cases to optimize JSON-formatted output.

If this is right

  • Small models can execute multi-step fractional calculations and blocking decisions required by inheritance law.
  • Domain pre-adaptation followed by structured output training produces competitive results against larger systems.
  • Quantized low-rank adaptation keeps computational costs low while maintaining high accuracy on jurisprudential tasks.
  • JSON-formatted legal outputs become reliable after the second training stage.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same two-stage pattern could transfer to other rule-heavy Arabic domains such as contract law or religious rulings.
  • On-device deployment becomes realistic for users needing quick inheritance calculations without cloud access.
  • Generalization tests on unseen schools of Islamic jurisprudence would clarify the limits of the current adaptation.

Load-bearing premise

The 3,166 fatwa records and 12,000 structured cases represent real-world queries well enough that the MIR-E metric measures actual legal reasoning rather than surface pattern matching.

What would settle it

Run the model on a fresh collection of inheritance problems that combine rules and heir configurations absent from both the fatwa and structured training sets, then check whether MIR-E accuracy drops well below 90%.

read the original abstract

Islamic inheritance law (ilm al-mawar{\i}th) presents a challenging domain for evaluating large language models' structured reasoning capabilities, requiring multi-step legal analysis, rule-based blocking decisions, and precise fractional calculations. We present QU-NLP's submission to the QIAS 2026 shared task on Arabic Islamic inheritance reasoning. Our approach employs a multi-stage Quantized Low-Rank Adaptation (QLoRA) fine-tuning strategy on Qwen3-4B: (1) domain adaptation on 3,166 Islamic fatwa records to acquire inheritance terminology and jurisprudential reasoning patterns, followed by (2) task-specific training on 12,000 structured inheritance cases to optimize JSON-formatted output generation. Using 4-bit NF4 quantization with rank-128 LoRA adapters, our model achieves 90% MIR-E (Mawarith Inheritance Reasoning Evaluation) score on the test set, demonstrating competitive performance while requiring minimal computational resources. Our results show that domain-specific pre-adaptation combined with structured output training enables small language models to perform complex legal reasoning tasks effectively comparing to commercial systems such as Gemini-2.5-flash.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 1 minor

Summary. The manuscript describes QU-NLP's submission to the QIAS 2026 shared task on Arabic Islamic inheritance reasoning. It applies a two-stage QLoRA fine-tuning procedure to Qwen3-4B: domain adaptation on 3,166 Islamic fatwa records followed by task-specific training on 12,000 structured inheritance cases using 4-bit NF4 quantization and rank-128 adapters. The model is reported to reach 90% on the MIR-E metric on the test set and is positioned as competitive with Gemini-2.5-flash while using minimal computational resources.

Significance. If the 90% MIR-E score reflects verified multi-step legal reasoning (including blocking rules and fractional share calculations) rather than output-format mimicry, the work would demonstrate that small quantized models can perform complex domain-specific reasoning with limited resources. This could support broader adoption of efficient fine-tuning pipelines for specialized legal and regulatory tasks.

major comments (3)
  1. [Abstract] Abstract: The MIR-E metric is referenced only at a high level with no definition, component breakdown (e.g., accuracy on heir identification, blocking conditions, or fractional arithmetic), or error analysis. Without these details it is impossible to determine whether the 90% score measures genuine reasoning or surface-level JSON pattern matching after training on 12,000 structured cases.
  2. [Abstract] Abstract: The claim of competitive performance versus Gemini-2.5-flash is unsupported by any numerical baseline, direct comparison table, or statistical test. No results from the shared-task leaderboard or alternative models are reported.
  3. [Results] No ablation or generalization experiments are described. The two-stage procedure is presented without controls that isolate the contribution of the fatwa pre-adaptation stage or that test performance on unseen combinations of heirs and blocking rules outside the 12,000 training cases.
minor comments (1)
  1. [Abstract] Abstract: The phrase 'comparing to commercial systems' should read 'compared to commercial systems'.

Simulated Author's Rebuttal

3 responses · 0 unresolved

Thank you for the constructive feedback on our submission to the QIAS 2026 shared task. We address each major comment point by point below. Where revisions are feasible, we will update the manuscript to improve clarity and rigor while remaining faithful to the experiments performed.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The MIR-E metric is referenced only at a high level with no definition, component breakdown (e.g., accuracy on heir identification, blocking conditions, or fractional arithmetic), or error analysis. Without these details it is impossible to determine whether the 90% score measures genuine reasoning or surface-level JSON pattern matching after training on 12,000 structured cases.

    Authors: We agree the abstract should define MIR-E more explicitly. In the revision we will add: 'MIR-E evaluates accuracy across heir identification, application of blocking rules, and precise fractional share calculations in structured JSON output.' A component-wise breakdown and error analysis already appear in the Results section; we will reference this explicitly from the abstract to demonstrate that the score reflects multi-step legal reasoning rather than format mimicry. revision: yes

  2. Referee: [Abstract] Abstract: The claim of competitive performance versus Gemini-2.5-flash is unsupported by any numerical baseline, direct comparison table, or statistical test. No results from the shared-task leaderboard or alternative models are reported.

    Authors: The original phrasing intended to highlight resource efficiency rather than direct numerical superiority. Because Gemini-2.5-flash was not part of the official shared-task evaluation, we lack official leaderboard numbers for it. We will revise the abstract to remove the specific Gemini comparison and instead emphasize that the 90% MIR-E score was obtained with a single-GPU QLoRA setup, positioning the result in terms of accessibility for domain-specific legal tasks. revision: yes

  3. Referee: [Results] No ablation or generalization experiments are described. The two-stage procedure is presented without controls that isolate the contribution of the fatwa pre-adaptation stage or that test performance on unseen combinations of heirs and blocking rules outside the 12,000 training cases.

    Authors: We acknowledge that explicit ablation studies isolating the fatwa adaptation stage and tests on novel heir combinations would strengthen the claims. Due to shared-task time and compute limits we did not run full controls; the two-stage design was selected after small-scale pilots showed improved terminology handling. In revision we will add a dedicated Limitations paragraph explaining this design choice and the absence of ablations, while noting that test-set performance provides indirect evidence of generalization within the task distribution. revision: partial

Circularity Check

0 steps flagged

No significant circularity in empirical fine-tuning results

full rationale

The paper reports an empirical multi-stage QLoRA fine-tuning procedure on 3,166 fatwa records followed by 12,000 structured cases, with performance measured by the MIR-E score on a held-out test set. No equations, mathematical derivations, uniqueness theorems, or ansatzes are present. The central claim is an experimental outcome (90% MIR-E) rather than a prediction derived from fitted parameters or self-referential definitions. No self-citations are load-bearing, and the MIR-E metric functions as an external evaluation score rather than a quantity defined in terms of the model's own outputs. The derivation chain is self-contained as standard supervised fine-tuning and evaluation.

Axiom & Free-Parameter Ledger

2 free parameters · 0 axioms · 0 invented entities

The central claim rests on standard supervised fine-tuning assumptions and the representativeness of the provided training corpora; no new entities or axioms are introduced.

free parameters (2)
  • LoRA rank = 128
    Set to 128 for the adapters in both stages.
  • Quantization precision = 4-bit NF4
    Fixed at 4-bit NF4.

pith-pipeline@v0.9.0 · 5503 in / 1057 out tokens · 36200 ms · 2026-05-14T22:11:11.421348+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

3 extracted references · 3 canonical work pages · 1 internal anchor

  1. [1]

    Abdessalam Bouchekif, Shahd Gaben, Samer Rashwani, Somaya Eltanbouly, Mutaz Al-Khatib, Heba Sbahi, Mohammed Ghaly, and Emad Mo- hamed

    From RAG to agentic RAG for faithful islamic question answering. Abdessalam Bouchekif, Shahd Gaben, Samer Rashwani, Somaya Eltanbouly, Mutaz Al-Khatib, Heba Sbahi, Mohammed Ghaly, and Emad Mo- hamed. 2026. MAWARITH: A dataset and bench- mark for legal inheritance reasoning with llms. Abdessalam Bouchekif, Samer Rashwani, Emad Soliman Ali Mohamed, Mutaz Al...

  2. [2]

    Edward J

    Deepseek-r1 incentivizes reasoning in LLMs through reinforcement learning.Nature, 645(8081):633–638. Edward J. Hu, Y elong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. 2021. LoRA: Low-rank adap- tation of large language models. Rana Malhas, Watheq Mansour, and Tamer El- sayed. 2022. Qur’an QA 2022: Overview of ...

  3. [3]

    Qwen3 Technical Report

    LEXTREME: A multi-lingual and multi-task benchmark for the legal domain. InFindings of the Association for Computational Linguistics: EMNLP 2023, pages 3016–3054, Singapore. As- sociation for Computational Linguistics. OpenAI, Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Floren- cia Leoni Aleman, Diogo Almeida, Janko Al- tenschmid...