Severity-Aware Weighted Loss for Arabic Medical Text Generation

Ahmed Alansary; Ali Hamdi; Molham Mohamed

arxiv: 2604.06346 · v1 · submitted 2026-04-07 · 💻 cs.CL · cs.AI

Severity-Aware Weighted Loss for Arabic Medical Text Generation

Ahmed Alansary , Molham Mohamed , Ali Hamdi This is my paper

Pith reviewed 2026-05-10 20:15 UTC · model grok-4.3

classification 💻 cs.CL cs.AI

keywords severity-aware lossArabic medical text generationfine-tuningweighted losslarge language modelsclinical risk prioritization

0 comments

The pith

A severity-weighted loss improves fine-tuning of Arabic language models for medical text generation by emphasizing high-risk cases.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper claims that uniform treatment of medical cases in standard fine-tuning overlooks differences in clinical severity, raising the risk of errors in critical situations. It introduces a weighted loss that scales each token's contribution according to soft severity probabilities, so the model pays more attention to severe interactions during optimization. This change requires no architectural modifications and is tested on Arabic medical complaint-response pairs. Results show larger and more consistent gains than ordinary cross-entropy fine-tuning across models of different sizes and designs.

Core claim

By deriving soft severity probabilities and using them to dynamically scale token-level loss contributions, severity-aware optimization produces higher-quality generations for Arabic medical complaints than treating every case with equal weight.

What carries the argument

The severity-aware weighted loss, which multiplies each token's loss term by a severity probability to prioritize clinically critical examples.

If this is right

Medical response generation improves without any change to model architecture or training data.
Gains appear across models that differ in size and internal design.
The method can be applied at the loss level only, keeping implementation lightweight.
Prioritizing severe cases reduces the chance that critical medical details are under-optimized.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar loss weighting could be tested in other high-stakes domains such as legal or safety-critical text generation.
The same idea might combine with additional risk signals beyond severity to create multi-factor prioritization.
Real-world deployment studies could measure whether the metric gains translate into fewer clinically harmful suggestions.

Load-bearing premise

The automatically assigned severity probabilities accurately reflect real clinical risk and that raising loss weight on those cases yields safer or higher-quality outputs rather than merely better scores on the evaluation metric.

What would settle it

Human experts rating generations from the weighted-loss model as no safer or less accurate than those from a standard fine-tuned model on a set of high-severity medical queries.

Figures

Figures reproduced from arXiv: 2604.06346 by Ahmed Alansary, Ali Hamdi, Molham Mohamed.

read the original abstract

Large language models have shown strong potential for Arabic medical text generation; however, traditional fine-tuning objectives treat all medical cases uniformly, ignoring differences in clinical severity. This limitation is particularly critical in healthcare settings, where errors in severe cases contain higher clinical risk. In this work, we propose a severity-aware weighted loss for fine-tuning Arabic language models on medical complaint-response data. The method depends on soft severity probabilities to dynamically scale token-level loss contributions during optimization, thereby prioritizing clinically critical interactions without modifying model architectures. Experiments are conducted using the MAQA dataset, which provides Arabic medical complaints and trusted human responses. Severity labels and probabilistic scores are automatically derived using a fine-tuned AraBERT-based classifier and incorporated exclusively at the loss level. The proposed approach is evaluated across ten Arabic large language models of varying architectures and parameter scales. While standard cross-entropy fine-tuning yields only modest improvements, severity-aware optimization consistently achieves larger gains. Using a balanced weighting configuration, performance improves from 54.04% to 66.14% for AraGPT2-Base, from 59.16% to 67.18% for AraGPT2-Medium, and from 57.83% to 66.86% for Qwen2.5-0.5B, with peak performance reaching 67.18%. Overall, severity-aware fine-tuning delivers improvements of up to 12.10% over non-fine-tuned baselines, demonstrating robust and architecture-consistent gains.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper gets consistent metric gains on Arabic medical generation by weighting loss with AraBERT-derived severity probabilities, but offers no evidence that the classifier tracks actual clinical risk.

read the letter

The core result here is straightforward: fine-tuning ten Arabic models on the MAQA dataset with a loss scaled by soft severity scores from a separate AraBERT classifier lifts performance numbers by up to 12 points over plain fine-tuning. The gains appear across model sizes and architectures, which is the main empirical contribution. That part is cleanly executed and worth noting for anyone doing domain adaptation in Arabic NLP. The method itself is not conceptually new—weighted losses and auxiliary classifiers are standard—but applying the combination specifically to medical complaint-response pairs in Arabic is a fresh empirical case. The authors keep the change at the loss level only, which is a practical choice that avoids architecture changes. What is missing is any check on whether the severity probabilities are reliable. The abstract gives no accuracy, F1, or calibration numbers for the AraBERT classifier, no inter-annotator agreement with clinicians on the MAQA labels, and no human evaluation of whether the higher-weighted outputs are actually safer or more accurate in clinical terms. Without those, the reported metric lifts could come from any re-weighting scheme rather than from prioritizing real risk. The generation metric itself is also left unspecified, so it is hard to judge what “performance” means here. The work is aimed at researchers building medical chat tools for Arabic-speaking regions. It is incremental rather than foundational, but the experiments are broad enough that a referee could usefully comment on the missing validation steps and on whether the gains hold under different severity label sources. I would send it for peer review rather than desk reject.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes a severity-aware weighted loss for fine-tuning Arabic LLMs on medical complaint-response generation using the MAQA dataset. Soft severity probabilities are obtained from a fine-tuned AraBERT classifier and used to scale the token-level cross-entropy loss during training. The approach is tested on ten models, reporting improvements such as 54.04% to 66.14% for AraGPT2-Base and up to 12.10% overall gains attributed to prioritizing high-severity cases.

Significance. If the severity classifier accurately reflects clinical risk and the metric improvements correspond to safer medical responses, this method offers a lightweight, architecture-independent way to enhance the reliability of Arabic medical text generation. The consistent gains across model scales are noteworthy, but the significance is limited by the lack of validation for the core weighting mechanism.

major comments (2)

The abstract reports performance improvements (e.g., 54.04% to 66.14% for AraGPT2-Base) without defining the generation metric, providing statistical tests, or detailing baseline comparisons. This is load-bearing for the central claim of robust gains, as the percentages cannot be interpreted without knowing whether they measure response accuracy, safety, or another quantity.
The Methods section (severity classifier description) provides no accuracy, macro-F1, calibration, or human-expert agreement metrics for the AraBERT-derived severity probabilities or labels. Since the weighted loss is constructed directly from these probabilities, the absence of validation means the reported gains may reflect arbitrary re-weighting rather than clinically meaningful prioritization of high-risk cases.

minor comments (2)

The abstract states evaluation across 'ten Arabic large language models of varying architectures and parameter scales' but does not enumerate the models or scales in the provided text.
The balanced weighting configuration is mentioned as yielding the peak results, but the exact weighting formula or hyperparameter values are not shown in the abstract.

Simulated Author's Rebuttal

2 responses · 1 unresolved

We thank the referee for their constructive and detailed feedback. We address each major comment below and indicate planned revisions to strengthen the manuscript.

read point-by-point responses

Referee: The abstract reports performance improvements (e.g., 54.04% to 66.14% for AraGPT2-Base) without defining the generation metric, providing statistical tests, or detailing baseline comparisons. This is load-bearing for the central claim of robust gains, as the percentages cannot be interpreted without knowing whether they measure response accuracy, safety, or another quantity.

Authors: We agree that the abstract should explicitly define the generation metric and provide additional context for interpretability. In the revised manuscript, we will update the abstract to specify the evaluation metric used for the reported percentages, elaborate on the baseline comparisons (including non-fine-tuned and standard fine-tuning setups), and reference the statistical tests performed to assess the significance of the improvements. revision: yes
Referee: The Methods section (severity classifier description) provides no accuracy, macro-F1, calibration, or human-expert agreement metrics for the AraBERT-derived severity probabilities or labels. Since the weighted loss is constructed directly from these probabilities, the absence of validation means the reported gains may reflect arbitrary re-weighting rather than clinically meaningful prioritization of high-risk cases.

Authors: We acknowledge the validity of this concern, as the severity probabilities are central to the proposed loss weighting. We will revise the Methods section to report the accuracy, macro-F1, and calibration metrics for the fine-tuned AraBERT severity classifier on held-out data. Human-expert agreement was not computed in the original study; we will add a discussion of this limitation and note that the consistent gains across ten models of varying scales provide supporting evidence for the approach's effectiveness. revision: partial

standing simulated objections not resolved

Human-expert agreement metrics for the severity labels, which would require new expert annotations not performed in the original experiments.

Circularity Check

0 steps flagged

No significant circularity; empirical gains are measured, not derived by construction.

full rationale

The paper defines a severity-weighted token loss using soft probabilities output by a separately fine-tuned AraBERT classifier on MAQA data, then fine-tunes ten Arabic LLMs and reports measured metric improvements (e.g., 54.04% to 66.14% on AraGPT2-Base). These performance deltas are experimental results on held-out generations; they do not reduce algebraically to the weighting parameters themselves, nor does any equation or self-citation chain make the headline claim tautological. The derivation chain is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

1 free parameters · 2 axioms · 0 invented entities

The approach rests on standard LLM fine-tuning assumptions plus the domain-specific premise that severity can be reliably estimated by a classifier and that loss reweighting improves clinical utility.

free parameters (1)

balanced weighting configuration
Specific scaling factors or probability thresholds for severity are selected to achieve the reported gains.

axioms (2)

standard math Cross-entropy is the appropriate base loss for autoregressive text generation
Invoked implicitly as the starting point before weighting.
domain assumption The fine-tuned AraBERT classifier produces soft probabilities that reflect true clinical severity
Used to derive weights but not validated against expert clinical judgment in the abstract.

pith-pipeline@v0.9.0 · 5560 in / 1292 out tokens · 58531 ms · 2026-05-10T20:15:33.778385+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

a severity-aware weighted loss ... w = α p_nc + β p_n + γ p_c ... L_SA = 1/L Σ w · (−log P(y_t | y_<t, X))
IndisputableMonolith/Foundation/AbsoluteFloorClosure.lean absolute_floor_iff_bare_distinguishability unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Severity labels and probabilistic scores are automatically derived using a fine-tuned AraBERT-based classifier

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

14 extracted references · 14 canonical work pages

[1]

In: 2025 In- telligent Methods, Systems, and Applications (IMSA)

Ahmed, S., Allam, A., Hamdi, A., Mohammed, A.: Arabic symptom classification and diagnosis using transformer models and llm-based augmentation. In: 2025 In- telligent Methods, Systems, and Applications (IMSA). pp. 18–23. IEEE, Egypt (July 2025)

work page 2025
[2]

In: 2025 4th International Conference on Computer Technologies (ICCTech)

Allam, A., Ahmed, S., Hamdi, A., Mohammed, A.: Arabic large language models for medical text generation. In: 2025 4th International Conference on Computer Technologies (ICCTech). pp. 1–6. IEEE, Malaysia (February 2025)

work page 2025
[3]

Badawi,A.,Rahimi,E.,Laskar,M.T.R.,Grach,S.,Bertrand,L.,Danok,L.,Huang, J., Rudzicz, F., Dolatabadi, E.: When can we trust llms in mental health? large- scale benchmarks for reliable llm evaluation (2025)

work page 2025
[4]

IEEE Access9, 133875– 133888 (2021)

Habib, M., Faris, M., Alomari, A., Faris, H.: Altibbivec: A word embedding model for medical and health applications in the arabic language. IEEE Access9, 133875– 133888 (2021)

work page 2021
[5]

IEEE, Qatar (October 2025)

Hamdi, A., Mohamed, M., Emad, R., Shaban, K.: An ensemble classification ap- proach in a multi-layered large language model framework for disease prediction. IEEE, Qatar (October 2025)

work page 2025
[6]

Springer Nature Switzerland, Cham (June 2024)

Klila, J., Souihi, S.B., Boujelben, R., Semmar, N., Belguith, L.H.: Adapting large language models to biomedical domain: A survey of techniques and approaches. Springer Nature Switzerland, Cham (June 2024)

work page 2024
[7]

IEEE Access8, 111626–111635 (2020)

Li, L., Doroslovački, M., Loew, M.H.: Approximating the gradient of cross-entropy loss function. IEEE Access8, 111626–111635 (2020)

work page 2020
[8]

In: Proceedings of the 40th International Conference on Machine Learning (ICML)

Mao, A., Mohri, M., Zhong, Y.: Cross-entropy loss functions: Theoretical analysis and applications. In: Proceedings of the 40th International Conference on Machine Learning (ICML). pp. 23803–23828. PMLR, USA (2023)

work page 2023
[9]

Frontiers in Pharmacology14, 1086913 (2023)

Meyer, C., Adkins, D., Pal, K., Galici, R., Garcia-Agundez, A., Eickhoff, C.: Neu- ral text generation in regulatory medical writing. Frontiers in Pharmacology14, 1086913 (2023)

work page 2023
[10]

In: 2025 Intelligent Methods, Systems, and Applications (IMSA)

Mohamed, M., Emad, R., Hamdi, A.: Ensemble transformers for arabic medi- cal text classification. In: 2025 Intelligent Methods, Systems, and Applications (IMSA). pp. 1–6. IEEE, Egypt (2025)

work page 2025
[11]

Springer Nature Singapore, Singapore (2025)

Mohamed, M., Emad, R., Hamdi, A.: A multi-layered large language model frame- work for disease prediction. Springer Nature Singapore, Singapore (2025)

work page 2025
[12]

Informatics11(3), 57 (August 2024)

Nazi, Z.A., Peng, W.: Large language models in healthcare and medical domain: A review. Informatics11(3), 57 (August 2024)

work page 2024
[13]

IEEE, Morocco (June 2025)

Ouali, S., Garouani, S.E., Chajia, M.: Integrating artificial intelligence into the ara- bic medical domain: A review of current progress, challenges, and future directions. IEEE, Morocco (June 2025)

work page 2025
[14]

Scientific Reports14(1), 27405 (2024)

Shim, J.W.: Enhancing cross entropy with a linearly adaptive loss function for optimized classification performance. Scientific Reports14(1), 27405 (2024)

work page 2024

[1] [1]

In: 2025 In- telligent Methods, Systems, and Applications (IMSA)

Ahmed, S., Allam, A., Hamdi, A., Mohammed, A.: Arabic symptom classification and diagnosis using transformer models and llm-based augmentation. In: 2025 In- telligent Methods, Systems, and Applications (IMSA). pp. 18–23. IEEE, Egypt (July 2025)

work page 2025

[2] [2]

In: 2025 4th International Conference on Computer Technologies (ICCTech)

Allam, A., Ahmed, S., Hamdi, A., Mohammed, A.: Arabic large language models for medical text generation. In: 2025 4th International Conference on Computer Technologies (ICCTech). pp. 1–6. IEEE, Malaysia (February 2025)

work page 2025

[3] [3]

Badawi,A.,Rahimi,E.,Laskar,M.T.R.,Grach,S.,Bertrand,L.,Danok,L.,Huang, J., Rudzicz, F., Dolatabadi, E.: When can we trust llms in mental health? large- scale benchmarks for reliable llm evaluation (2025)

work page 2025

[4] [4]

IEEE Access9, 133875– 133888 (2021)

Habib, M., Faris, M., Alomari, A., Faris, H.: Altibbivec: A word embedding model for medical and health applications in the arabic language. IEEE Access9, 133875– 133888 (2021)

work page 2021

[5] [5]

IEEE, Qatar (October 2025)

Hamdi, A., Mohamed, M., Emad, R., Shaban, K.: An ensemble classification ap- proach in a multi-layered large language model framework for disease prediction. IEEE, Qatar (October 2025)

work page 2025

[6] [6]

Springer Nature Switzerland, Cham (June 2024)

Klila, J., Souihi, S.B., Boujelben, R., Semmar, N., Belguith, L.H.: Adapting large language models to biomedical domain: A survey of techniques and approaches. Springer Nature Switzerland, Cham (June 2024)

work page 2024

[7] [7]

IEEE Access8, 111626–111635 (2020)

Li, L., Doroslovački, M., Loew, M.H.: Approximating the gradient of cross-entropy loss function. IEEE Access8, 111626–111635 (2020)

work page 2020

[8] [8]

In: Proceedings of the 40th International Conference on Machine Learning (ICML)

Mao, A., Mohri, M., Zhong, Y.: Cross-entropy loss functions: Theoretical analysis and applications. In: Proceedings of the 40th International Conference on Machine Learning (ICML). pp. 23803–23828. PMLR, USA (2023)

work page 2023

[9] [9]

Frontiers in Pharmacology14, 1086913 (2023)

Meyer, C., Adkins, D., Pal, K., Galici, R., Garcia-Agundez, A., Eickhoff, C.: Neu- ral text generation in regulatory medical writing. Frontiers in Pharmacology14, 1086913 (2023)

work page 2023

[10] [10]

In: 2025 Intelligent Methods, Systems, and Applications (IMSA)

Mohamed, M., Emad, R., Hamdi, A.: Ensemble transformers for arabic medi- cal text classification. In: 2025 Intelligent Methods, Systems, and Applications (IMSA). pp. 1–6. IEEE, Egypt (2025)

work page 2025

[11] [11]

Springer Nature Singapore, Singapore (2025)

Mohamed, M., Emad, R., Hamdi, A.: A multi-layered large language model frame- work for disease prediction. Springer Nature Singapore, Singapore (2025)

work page 2025

[12] [12]

Informatics11(3), 57 (August 2024)

Nazi, Z.A., Peng, W.: Large language models in healthcare and medical domain: A review. Informatics11(3), 57 (August 2024)

work page 2024

[13] [13]

IEEE, Morocco (June 2025)

Ouali, S., Garouani, S.E., Chajia, M.: Integrating artificial intelligence into the ara- bic medical domain: A review of current progress, challenges, and future directions. IEEE, Morocco (June 2025)

work page 2025

[14] [14]

Scientific Reports14(1), 27405 (2024)

Shim, J.W.: Enhancing cross entropy with a linearly adaptive loss function for optimized classification performance. Scientific Reports14(1), 27405 (2024)

work page 2024