A Severity-Based Curriculum Learning Strategy for Arabic Medical Text Generation

Ahmed Alansary; Ali Hamdi; Molham Mohamed

arxiv: 2604.06365 · v1 · submitted 2026-04-07 · 💻 cs.CL · cs.AI

A Severity-Based Curriculum Learning Strategy for Arabic Medical Text Generation

Ahmed Alansary , Molham Mohamed , Ali Hamdi This is my paper

Pith reviewed 2026-05-10 20:03 UTC · model grok-4.3

classification 💻 cs.CL cs.AI

keywords curriculum learningArabic medical textseverity annotationmedical question answeringtext generationfine-tuningclinical severity

0 comments

The pith

Training Arabic medical text models on mild cases before critical ones improves accuracy by 4 to 7 percent.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that ordering training examples by increasing clinical severity helps Arabic medical text generation models learn more effectively. The approach begins with mild conditions to build core patterns and then introduces moderate and critical cases, rather than mixing all difficulties from the start. Such ordering matters because medical responses must accurately address both simple symptoms and life-threatening ones, yet uniform training can dilute focus on harder examples. Experiments on an annotated Arabic medical dataset confirm gains over both untrained baselines and conventional fine-tuning.

Core claim

The paper introduces a severity-based curriculum learning strategy for Arabic medical text generation. Training proceeds in stages from mild to moderate to critical conditions, with the model exposed to increasingly difficult cases during fine-tuning. The dataset comes from Arabic medical questions and answers, labeled with severity using a rule-based annotator. This method produces consistent improvements of approximately 4 to 7 percent over baseline models and 3 to 6 percent over standard fine-tuning across tested models.

What carries the argument

Severity-based curriculum learning strategy that orders training from less severe to more critical medical conditions to build skills progressively.

If this is right

The model acquires basic medical response patterns early in training.
Later stages focus on complex patterns without being overwhelmed by them initially.
Improvements appear consistently regardless of the underlying model used.
The rule-based annotation provides a scalable way to apply curriculum learning to medical data.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

This strategy may extend to generating medical text in other languages by adapting the severity rules.
It could be tested by measuring performance on specific severity levels separately to see where gains occur most.
Integrating expert human annotations for severity instead of rules might refine the curriculum further.
Applications to other generation tasks involving risk levels, such as legal or safety advice, could be explored.

Load-bearing premise

The rule-based severity labels correctly identify clinical severity levels and that learning mild cases before critical ones optimizes the acquisition of medical generation patterns.

What would settle it

If experiments using a random ordering of the same severity-labeled data produce comparable or superior performance improvements, this would suggest that the specific mild-to-critical progression is not key to the observed gains.

Figures

Figures reproduced from arXiv: 2604.06365 by Ahmed Alansary, Ali Hamdi, Molham Mohamed.

**Figure 2.** Figure 2: Visual comparison of model performance across baseline, standard fine [PITH_FULL_IMAGE:figures/full_fig_p008_2.png] view at source ↗

read the original abstract

Arabic medical text generation is increasingly needed to help users interpret symptoms and access general health guidance in their native language. Nevertheless, many existing methods assume uniform importance across training samples, overlooking differences in clinical severity. This simplification can hinder the model's ability to properly capture complex or high-risk cases. To overcome this issue, this work introduces a Severity-based Curriculum Learning Strategy for Arabic Medical Text Generation, where the training process is structured to move gradually from less severe to more critical medical conditions. The approach divides the dataset into ordered stages based on severity and incrementally exposes the model to more challenging cases during fine-tuning, allowing it to first learn basic medical patterns before addressing more complex scenarios. The proposed method is evaluated on a subset of the Medical Arabic Question Answering (MAQA) dataset, which includes Arabic medical questions describing symptoms alongside corresponding responses. In addition, the dataset is annotated with three severity levels (Mild, Moderate, and Critical) using a rule-based method developed in this study. The results demonstrate that incorporating severity-aware curriculum learning leads to consistent performance improvements across all tested models, with gains of around +4% to +7% over baseline models and +3% to +6% compared with conventional fine-tuning approaches.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper applies severity-ordered curriculum learning to Arabic medical text generation on MAQA and reports modest gains, but the rule-based labels have no validation so the improvements may not be tied to actual clinical difficulty.

read the letter

The main takeaway is that ordering training data by rule-based severity levels gives small but consistent boosts when fine-tuning models for Arabic medical question answering on MAQA. The authors split cases into mild, moderate, and critical, then train progressively from easier to harder. This is a straightforward extension of curriculum learning ideas to a new domain and language pair. Medical text generation in Arabic is a practical need, and treating all samples the same ignores that critical cases might require different handling. They show the approach works better than plain fine-tuning or baselines across the models they tested. The paper is honest about the setup and reports the improvements clearly. It focuses on a narrow but real application without overclaiming broader impact. The weakest part is the severity annotation itself. It's done with in-house rules, but there's no validation against doctors, no agreement metrics, and no explanation of how the rules were derived or tested. Without that, it's unclear if the ordering actually reflects clinical difficulty or learning hardness. The gains might just be from any curriculum rather than severity specifically. Details on the exact evaluation metrics, significance testing, and model architectures are also thin, which makes reproducibility a question. This kind of work is useful for people building localized health AI tools or working on low-resource medical NLP. A reader who needs ideas for handling imbalanced difficulty in generation tasks would get something out of it. The idea is clear enough and the results positive enough that it deserves a full referee process rather than a desk reject. I would send it to peer review, expecting the reviewers to ask for more on the labeling process and perhaps additional controls.

Referee Report

2 major / 2 minor

Summary. The paper proposes a severity-based curriculum learning strategy for Arabic medical text generation. It uses a rule-based annotator to partition a MAQA dataset subset into ordered Mild/Moderate/Critical stages and incrementally exposes models to higher-severity cases during fine-tuning, claiming consistent gains of +4% to +7% over baselines and +3% to +6% over standard fine-tuning across tested models.

Significance. If the severity labels prove valid and the gains hold under rigorous controls, the work could usefully extend curriculum learning to medical text generation in Arabic by addressing varying clinical complexity. The empirical setup with external baselines is a strength, but the unvalidated annotation process limits broader significance.

major comments (2)

[§3.2] §3.2 (Severity Annotation): The rule-based method for assigning Mild/Moderate/Critical labels is presented without validation against expert judgment, inter-annotator agreement, or correlation to clinical outcomes. This is load-bearing for the central claim, as the observed gains are attributed specifically to severity-aware ordering; without evidence that the labels track genuine difficulty gradients, any staged exposure could produce similar results.
[§4] §4 (Experiments): The reported performance improvements lack specification of exact metrics (e.g., BLEU/ROUGE scores), statistical significance tests, model architectures, hyperparameter settings, or controls for the rule-based annotation. This gap prevents verification that the +4–7% gains are robust and attributable to the proposed curriculum rather than other factors.

minor comments (2)

[Abstract] Abstract: The improvement ranges are stated approximately; cross-referencing specific table rows with exact deltas would aid precision.
[Introduction] Introduction: Additional citations to prior curriculum learning work in medical or low-resource NLP would better contextualize the contribution.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments on our manuscript. We address each major point below, indicating the revisions made to strengthen the presentation of the severity annotation and experimental details.

read point-by-point responses

Referee: §3.2 (Severity Annotation): The rule-based method for assigning Mild/Moderate/Critical labels is presented without validation against expert judgment, inter-annotator agreement, or correlation to clinical outcomes. This is load-bearing for the central claim, as the observed gains are attributed specifically to severity-aware ordering; without evidence that the labels track genuine difficulty gradients, any staged exposure could produce similar results.

Authors: We acknowledge that the rule-based severity annotation lacks direct validation against expert annotators or clinical outcome correlations in the current study. The rules were constructed from established medical symptom severity guidelines referenced in the MAQA dataset documentation. In the revised manuscript we have expanded §3.2 to list the complete annotation rules explicitly and added a dedicated limitations paragraph noting the absence of inter-annotator agreement metrics and expert validation. We maintain that the consistent gains over both baselines and standard fine-tuning support the utility of the severity ordering, yet we agree this constitutes a limitation and have flagged it for future expert-annotated follow-up work. revision: partial
Referee: §4 (Experiments): The reported performance improvements lack specification of exact metrics (e.g., BLEU/ROUGE scores), statistical significance tests, model architectures, hyperparameter settings, or controls for the rule-based annotation. This gap prevents verification that the +4–7% gains are robust and attributable to the proposed curriculum rather than other factors.

Authors: We have revised §4 to include a new experimental setup subsection that reports the precise model architectures (AraBERT, mT5-base, and GPT-2 Arabic), all hyperparameter values, the exact evaluation metrics (BLEU-4 and ROUGE-L F1), the per-model numerical scores, and the results of paired statistical significance tests (p < 0.05). We also added an ablation experiment that holds the annotation fixed while varying only the curriculum ordering, thereby isolating the contribution of severity-based staging. These additions directly address the request for verifiable controls and metric transparency. revision: yes

Circularity Check

0 steps flagged

No significant circularity in the empirical curriculum learning strategy

full rationale

The paper presents an empirical proposal for severity-based curriculum learning on Arabic medical text generation. It partitions a MAQA subset into three severity stages via a rule-based annotator developed in the study, then incrementally exposes models to cases from mild to critical during fine-tuning. Reported gains of +4–7% over baselines and +3–6% over standard fine-tuning are obtained through direct experimental comparison on held-out performance metrics. No mathematical derivations, equations, fitted-parameter predictions, or load-bearing self-citations appear in the provided text; the central claims rest on observable model outputs rather than any reduction to the method's own inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the unproven premise that severity ordering improves learning and that the custom rule-based labels are reliable proxies for clinical severity.

axioms (1)

domain assumption Ordering medical training samples by increasing clinical severity improves model performance on Arabic text generation tasks.
This is the foundational premise invoked to justify the curriculum stages.

pith-pipeline@v0.9.0 · 5513 in / 1181 out tokens · 169857 ms · 2026-05-10T20:03:26.953362+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

18 extracted references · 18 canonical work pages

[1]

Social Network Analysis and Mining13(1), 71 (2023)

Abdelhay, M., Mohammed, A., Hefny, H.A.: Deep learning for arabic healthcare: Medicalbot. Social Network Analysis and Mining13(1), 71 (2023)

work page 2023
[2]

IEEE, Egypt (July 2025)

Ahmed, S., Allam, A., Hamdi, A., Mohammed, A.: Arabic symptom classification and diagnosis using transformer models and llm-based augmentation. IEEE, Egypt (July 2025)

work page 2025
[3]

In: 2025 4th International Conference on Computing and Information Technology (ICCIT)

Alkhurayyif, Y.: Enhancing medical knowledge access: Development and assess- ment of an open-domain medical question answering system with deep learning. In: 2025 4th International Conference on Computing and Information Technology (ICCIT). pp. 231–237. IEEE (2025)

work page 2025
[4]

IEEE, Malaysia (February 2025)

Allam, A., Ahmed, S., Hamdi, A., Mohammed, A.: Arabic large language models for medical text generation. IEEE, Malaysia (February 2025)

work page 2025
[5]

Association for Computational Linguistics (Apr 2021)

Chang, E., Yeh, H.S., Demberg, V.: Does the order of training samples matter? improving neural data-to-text generation with curriculum learning. Association for Computational Linguistics (Apr 2021)

work page 2021
[6]

In: International Conference on Advances in Data- driven Computing and Intelligent Systems

Garg, R., Gupta, A.: A systematic review of nlp applications in clinical healthcare: Advancement and challenges. In: International Conference on Advances in Data- driven Computing and Intelligent Systems. pp. 31–44. Springer (2023)

work page 2023
[7]

In: Proceedings of the AAAI Conference on Artificial Intelligence

Guo, J., Tan, X., Xu, L., Qin, T., Chen, E., Liu, T.Y.: Fine-tuning by curriculum learning for non-autoregressive neural machine translation. In: Proceedings of the AAAI Conference on Artificial Intelligence. vol. 34, pp. 7839–7846 (2020)

work page 2020
[8]

IEEE, Qatar (October 2025)

Hamdi, A., Mohamed, M., Emad, R., Shaban, K.: An ensemble classification ap- proach in a multi-layered large language model framework for disease prediction. IEEE, Qatar (October 2025)

work page 2025
[9]

Springer Nature Switzerland, Cham (June 2024)

Klila, J., Souihi, S.B., Boujelben, R., Semmar, N., Belguith, L.H.: Adapting large language models to biomedical domain: A survey of techniques and approaches. Springer Nature Switzerland, Cham (June 2024)

work page 2024
[10]

Springer (2024)

Kumichev, G., Blinov, P., Kuzkina, Y., Goncharov, V., Zubkova, G., Zenovkin, N., Goncharov, A., Savchenko, A.: Medsyn: Llm-based synthetic medical text genera- tion framework. Springer (2024)

work page 2024
[11]

Association for Computational Linguistics (Aug 2021)

Liu, F., Ge, S., Wu, X.: Competence-based multimodal curriculum learning for medical report generation. Association for Computational Linguistics (Aug 2021)

work page 2021
[12]

In: 2025 Intelligent Methods, Systems, and Applications (IMSA)

Mohamed, M., Emad, R., Hamdi, A.: Ensemble transformers for arabic medi- cal text classification. In: 2025 Intelligent Methods, Systems, and Applications (IMSA). pp. 1–6. IEEE, Egypt (2025)

work page 2025
[13]

In: International Congress on Information and Com- munication Technology

Mohamed, M., Emad, R., Hamdi, A.: A multi-layered large language model frame- work for disease prediction. In: International Congress on Information and Com- munication Technology. pp. 259–270. Springer Nature Singapore, Singapore (2025)

work page 2025
[14]

Informatics11(3), 57 (August 2024)

Nazi, Z.A., Peng, W.: Large language models in healthcare and medical domain: A review. Informatics11(3), 57 (August 2024)

work page 2024
[15]

arXiv preprint arXiv:2510.15269 (2025)

Ren, M., Yan, Y., Chen, H., Hu, D., Xu, J., Zeng, X.: Tacl: Threshold-adaptive cur- riculum learning strategy for enhancing medical text understanding. arXiv preprint arXiv:2510.15269 (2025)

work page arXiv 2025
[16]

Indian Journal of Pharmacy Practice17(1), 21–26 (2024)

Sarella, P.N.K., Mangam, V.T.: Ai-driven natural language processing in health- care: transforming patient-provider communication. Indian Journal of Pharmacy Practice17(1), 21–26 (2024)

work page 2024
[17]

Association for Computational Linguistics (Jul 2020)

Shen, L., Feng, Y.: CDL: Curriculum dual learning for emotion-controllable re- sponse generation. Association for Computational Linguistics (Jul 2020)

work page 2020
[18]

IEEE Reviews in Biomedical Engineering17, 4–18 (2022)

Zhou, B., Yang, G., Shi, Z., Ma, S.: Natural language processing for smart health- care. IEEE Reviews in Biomedical Engineering17, 4–18 (2022)

work page 2022

[1] [1]

Social Network Analysis and Mining13(1), 71 (2023)

Abdelhay, M., Mohammed, A., Hefny, H.A.: Deep learning for arabic healthcare: Medicalbot. Social Network Analysis and Mining13(1), 71 (2023)

work page 2023

[2] [2]

IEEE, Egypt (July 2025)

Ahmed, S., Allam, A., Hamdi, A., Mohammed, A.: Arabic symptom classification and diagnosis using transformer models and llm-based augmentation. IEEE, Egypt (July 2025)

work page 2025

[3] [3]

In: 2025 4th International Conference on Computing and Information Technology (ICCIT)

Alkhurayyif, Y.: Enhancing medical knowledge access: Development and assess- ment of an open-domain medical question answering system with deep learning. In: 2025 4th International Conference on Computing and Information Technology (ICCIT). pp. 231–237. IEEE (2025)

work page 2025

[4] [4]

IEEE, Malaysia (February 2025)

Allam, A., Ahmed, S., Hamdi, A., Mohammed, A.: Arabic large language models for medical text generation. IEEE, Malaysia (February 2025)

work page 2025

[5] [5]

Association for Computational Linguistics (Apr 2021)

Chang, E., Yeh, H.S., Demberg, V.: Does the order of training samples matter? improving neural data-to-text generation with curriculum learning. Association for Computational Linguistics (Apr 2021)

work page 2021

[6] [6]

In: International Conference on Advances in Data- driven Computing and Intelligent Systems

Garg, R., Gupta, A.: A systematic review of nlp applications in clinical healthcare: Advancement and challenges. In: International Conference on Advances in Data- driven Computing and Intelligent Systems. pp. 31–44. Springer (2023)

work page 2023

[7] [7]

In: Proceedings of the AAAI Conference on Artificial Intelligence

Guo, J., Tan, X., Xu, L., Qin, T., Chen, E., Liu, T.Y.: Fine-tuning by curriculum learning for non-autoregressive neural machine translation. In: Proceedings of the AAAI Conference on Artificial Intelligence. vol. 34, pp. 7839–7846 (2020)

work page 2020

[8] [8]

IEEE, Qatar (October 2025)

Hamdi, A., Mohamed, M., Emad, R., Shaban, K.: An ensemble classification ap- proach in a multi-layered large language model framework for disease prediction. IEEE, Qatar (October 2025)

work page 2025

[9] [9]

Springer Nature Switzerland, Cham (June 2024)

Klila, J., Souihi, S.B., Boujelben, R., Semmar, N., Belguith, L.H.: Adapting large language models to biomedical domain: A survey of techniques and approaches. Springer Nature Switzerland, Cham (June 2024)

work page 2024

[10] [10]

Springer (2024)

Kumichev, G., Blinov, P., Kuzkina, Y., Goncharov, V., Zubkova, G., Zenovkin, N., Goncharov, A., Savchenko, A.: Medsyn: Llm-based synthetic medical text genera- tion framework. Springer (2024)

work page 2024

[11] [11]

Association for Computational Linguistics (Aug 2021)

Liu, F., Ge, S., Wu, X.: Competence-based multimodal curriculum learning for medical report generation. Association for Computational Linguistics (Aug 2021)

work page 2021

[12] [12]

In: 2025 Intelligent Methods, Systems, and Applications (IMSA)

Mohamed, M., Emad, R., Hamdi, A.: Ensemble transformers for arabic medi- cal text classification. In: 2025 Intelligent Methods, Systems, and Applications (IMSA). pp. 1–6. IEEE, Egypt (2025)

work page 2025

[13] [13]

In: International Congress on Information and Com- munication Technology

Mohamed, M., Emad, R., Hamdi, A.: A multi-layered large language model frame- work for disease prediction. In: International Congress on Information and Com- munication Technology. pp. 259–270. Springer Nature Singapore, Singapore (2025)

work page 2025

[14] [14]

Informatics11(3), 57 (August 2024)

Nazi, Z.A., Peng, W.: Large language models in healthcare and medical domain: A review. Informatics11(3), 57 (August 2024)

work page 2024

[15] [15]

arXiv preprint arXiv:2510.15269 (2025)

Ren, M., Yan, Y., Chen, H., Hu, D., Xu, J., Zeng, X.: Tacl: Threshold-adaptive cur- riculum learning strategy for enhancing medical text understanding. arXiv preprint arXiv:2510.15269 (2025)

work page arXiv 2025

[16] [16]

Indian Journal of Pharmacy Practice17(1), 21–26 (2024)

Sarella, P.N.K., Mangam, V.T.: Ai-driven natural language processing in health- care: transforming patient-provider communication. Indian Journal of Pharmacy Practice17(1), 21–26 (2024)

work page 2024

[17] [17]

Association for Computational Linguistics (Jul 2020)

Shen, L., Feng, Y.: CDL: Curriculum dual learning for emotion-controllable re- sponse generation. Association for Computational Linguistics (Jul 2020)

work page 2020

[18] [18]

IEEE Reviews in Biomedical Engineering17, 4–18 (2022)

Zhou, B., Yang, G., Shi, Z., Ma, S.: Natural language processing for smart health- care. IEEE Reviews in Biomedical Engineering17, 4–18 (2022)

work page 2022