pith. sign in

arxiv: 2606.21460 · v1 · pith:LI6WNUFMnew · submitted 2026-06-19 · 💻 cs.CL · cs.AI

Evaluation of Small Language Models for Arabic Language Processing

Pith reviewed 2026-06-26 14:20 UTC · model grok-4.3

classification 💻 cs.CL cs.AI
keywords Arabic NLPSmall Language ModelsBenchmark EvaluationLLM-as-a-JudgeZero-shot TestingModel AlignmentArabic Instruction FollowingLanguage Model Performance
0
0 comments X

The pith

Gemma 3 (12B) scores highest among twelve small Arabic language models because better alignment and instruction following outweigh raw size.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests twelve small language models on Arabic comprehension and generation tasks using a new set of 240 items spread across eight domains and ten skills. All evaluations run in zero-shot mode with Arabic-only prompts, and responses receive scores from three separate LLM judges whose results are averaged. Gemma 3 records the top mark of 4.548 out of 5, while models that show stronger Arabic-specific tuning and reliable instruction adherence outperform others regardless of parameter count. Lower-scoring models commonly fail through prompt leakage, hallucination, language mixing, or incomplete answers. The benchmark therefore supplies a practical reference for measuring progress on compact, culturally suitable Arabic systems.

Core claim

By running a controlled zero-shot benchmark on 240 Arabic items and scoring outputs with a multi-model LLM-as-a-judge panel, the evaluation shows Gemma 3 (12B) achieving the highest aggregate score of 4.548/5. Performance does not track model size alone; models that exhibit stronger Arabic alignment and more consistent instruction following produce better results across both comprehension and generation tasks. Common failure modes in weaker models include prompt leakage, hallucination, language drift, and weak task adherence.

What carries the argument

A zero-shot Arabic-only benchmark of 240 items across eight domains and ten skills, scored by an aggregated multi-model LLM-as-a-judge panel.

If this is right

  • Future Arabic SLM development should prioritize explicit alignment and instruction-tuning data over simply increasing parameter count.
  • The benchmark can serve as a repeatable reference for comparing new compact Arabic models on both understanding and generation.
  • Models that avoid common failure patterns such as language drift and incomplete generation will show measurable gains on the same test set.
  • Compact models with reliable Arabic instruction following can reach high practical performance without matching the size of larger general models.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Similar benchmarks could be built for other low-resource languages to test whether alignment effects dominate size there as well.
  • The identified failure patterns suggest targeted data collection focused on prompt adherence and language consistency could lift the lower-performing models.
  • If the judge framework proves stable, it offers a low-cost way to iterate on new Arabic SLMs before larger human evaluations.

Load-bearing premise

The three LLM judges produce consistent and unbiased scores that accurately reflect Arabic language quality.

What would settle it

Human Arabic speakers independently rating the same model outputs produce a different ranking of the twelve models than the aggregated LLM-judge scores.

read the original abstract

This paper evaluates the performance of twelve Small Language Models (SLMs) on Arabic natural language processing tasks. The study introduces a benchmark of 240 Arabic test items distributed across eight domains and ten language skills, covering both comprehension-oriented and generation-oriented tasks. All models were evaluated under a controlled zero-shot setting using a standardized Arabic-only prompt template. Model responses were assessed through a multi-model LLM-as-a-judge framework involving GPT-4.1 Mini, Claude Haiku 4.5, and DeepSeek-Chat, with scores aggregated across judges and analyzed by task, skill, and model family. The results show that Gemma 3 (12B) achieved the highest overall score (4.548/5), followed by Aya and C4AI Command Arabic. The observed results suggest that model size alone does not explain Arabic SLM performance. Models with stronger Arabic alignment and more reliable instruction-following behavior tended to perform better across tasks. Common failure patterns among lower-performing models include prompt leakage, hallucination, language drift, incomplete generation, and weak task adherence. Overall, the benchmark provides a structured reference for evaluating compact Arabic language models and supports future work on efficient, reliable, and culturally appropriate Arabic AI systems.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper evaluates twelve small language models on Arabic NLP tasks via a new benchmark of 240 items spanning eight domains and ten skills. All evaluations use controlled zero-shot prompting with an Arabic-only template; responses are scored by an aggregated multi-LLM judge (GPT-4.1 Mini, Claude Haiku 4.5, DeepSeek-Chat). Gemma 3 (12B) obtains the highest mean score (4.548/5), followed by Aya and C4AI Command Arabic; the authors conclude that Arabic alignment and instruction-following matter more than parameter count, and they catalog common failure modes such as prompt leakage and language drift.

Significance. If the scores prove reliable, the work supplies a compact, publicly usable Arabic benchmark focused on SLMs and surfaces practical failure patterns relevant to culturally appropriate deployment. The empirical comparison across model families adds a data point to the growing literature on non-English SLM evaluation.

major comments (2)
  1. [Abstract and Evaluation Framework] Abstract and Evaluation Framework: the central ranking (Gemma 3 highest) and the claim that 'stronger Arabic alignment and more reliable instruction-following behavior tended to perform better' rest entirely on aggregated LLM-as-a-judge scores. No calibration procedure, inter-judge agreement statistic (e.g., Fleiss' kappa), or human correlation on the 240 Arabic items is reported. Without these anchors, the numerical ordering and the alignment-versus-size conclusion cannot be verified.
  2. [Results section] Results section: model-wise scores are presented without statistical significance tests, confidence intervals, or multiple-comparison correction. Consequently the statement that size alone does not explain performance lacks a quantitative basis for distinguishing signal from noise across the ten skills.
minor comments (1)
  1. [Abstract] The abstract states 'ten language skills' but does not enumerate them; a short list or pointer to the methods table would improve readability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments, which highlight important aspects of evaluation reliability and statistical rigor. We address each major comment below and commit to revisions that strengthen the manuscript without altering its core findings.

read point-by-point responses
  1. Referee: [Abstract and Evaluation Framework] Abstract and Evaluation Framework: the central ranking (Gemma 3 highest) and the claim that 'stronger Arabic alignment and more reliable instruction-following behavior tended to perform better' rest entirely on aggregated LLM-as-a-judge scores. No calibration procedure, inter-judge agreement statistic (e.g., Fleiss' kappa), or human correlation on the 240 Arabic items is reported. Without these anchors, the numerical ordering and the alignment-versus-size conclusion cannot be verified.

    Authors: We agree that explicit validation metrics for the LLM-as-a-judge framework would increase confidence in the reported rankings. In the revised manuscript we will add Fleiss' kappa computed across the three judges on all 240 items to quantify inter-judge agreement. We will also perform a human correlation study on a stratified subset of 50 items (balanced across domains and skills) scored by two native Arabic-speaking annotators, reporting Pearson and Spearman correlations with the aggregated LLM scores. These additions directly address the request for calibration anchors while preserving the zero-shot Arabic-only evaluation protocol. revision: yes

  2. Referee: [Results section] Results section: model-wise scores are presented without statistical significance tests, confidence intervals, or multiple-comparison correction. Consequently the statement that size alone does not explain performance lacks a quantitative basis for distinguishing signal from noise across the ten skills.

    Authors: We acknowledge that the current results section would benefit from formal statistical support. In the revision we will report 95% bootstrap confidence intervals (1,000 resamples) for each model's mean score and per-skill scores. We will also conduct paired Wilcoxon signed-rank tests between all model pairs, applying Bonferroni correction for the ten skills, and present the resulting p-values alongside the mean scores. These tests will provide a quantitative basis for the claim that Arabic alignment and instruction-following explain more variance than parameter count alone. revision: yes

Circularity Check

0 steps flagged

Empirical evaluation with external judges; no derivation chain or fitted inputs

full rationale

The paper is a straightforward empirical benchmark study reporting scores from 12 SLMs on 240 Arabic items judged by external models (GPT-4.1 Mini, Claude Haiku 4.5, DeepSeek-Chat). No equations, parameter fitting, self-citations, or uniqueness theorems appear in the abstract or described methodology. All claims (e.g., Gemma 3 highest at 4.548/5, alignment > size) are direct aggregates of observed scores rather than reductions to the paper's own inputs. This is the most common honest non-finding for evaluation papers.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is a pure empirical benchmarking study. No free parameters are fitted, no new axioms are introduced beyond standard assumptions of LLM evaluation, and no new entities are postulated.

pith-pipeline@v0.9.1-grok · 5777 in / 1161 out tokens · 14046 ms · 2026-06-26T14:20:51.867385+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

22 extracted references · 5 canonical work pages · 1 internal anchor

  1. [1]

    F., Masoud, R

    Alkhowaiter, M., Alshahrani, S., Alshahrani, N. F., Masoud, R. I., Alzahrani, A., Alnuhait, D., ... & Almubarak, K. (2025, November). Mind the Gap: A Review of Arabic Post- Training Datasets and Their Limitations. In Proceedings of The Third Arabic Nat- ural Language Processing Conference (pp. 323-337)

  2. [2]

    Al-Duwais, M., Al-Khalifa, H., & Al-Salman, A. (2024). A benchmark evaluation of mul- tilingual large language models for arabic cross-lingual named-entity recognition. Elec- tronics, 13(17), 3574

  3. [3]

    & Noune, B

    Almazrouei, E., Cojocaru, R., Baldo, M., Malartic, Q., Alobeidli, H., Mazzotta, D., ... & Noune, B. (2023, December). AlGhafa evaluation benchmark for Arabic language models. In Proceedings of ArabicNLP 2023 (pp. 244-275)

  4. [4]

    H., Habash, N., Freihat, A

    Altakrori, M. H., Habash, N., Freihat, A. A., Samih, Y., Chirkunov, K., AbuOdeh, M., ... & Aji, A. F. (2025). Dialectalarabicmmlu: Benchmarking dialectal capabilities in ara- bic and multilingual language models. arXiv preprint arXiv:2510.27543

  5. [5]

    A., Abouzahir, C., Kharouf, L., Al-Eisawi, W., Habash, N., & Shamout, F

    Daoud, M. A., Abouzahir, C., Kharouf, L., Al-Eisawi, W., Habash, N., & Shamout, F. E. (2025). Medarabiq: Benchmarking large lan- guage models on arabic medical tasks. arXiv preprint arXiv:2505.03427

  6. [6]

    Khondaker, M. T. I., Naeem, N., Khan, F., Elmadany, A., & Abdul-Mageed, M. (2024, August). Benchmarking llama-3 on arabic language generation tasks. In Proceedings of The Second Arabic Natural Language Pro- cessing Conference (pp. 283-297)

  7. [7]

    A., Hasanain, M., Kabbani, T.,

    Mousi, B., Durrani, N., Ahmad, F., Hasan, M. A., Hasanain, M., Kabbani, T., ... & Alam, F. (2025, January). Aradice: Bench- marks for dialectal and cultural capabilities 13 in llms. In Proceedings of the 31st Interna- tional Conference on Computational Linguis- tics (pp. 4186-4218)

  8. [8]

    Mashabi, M., Al-Khalifa, S., & Al-Khalifa, H. (2024). A survey of large language mod- els for Arabic language and its dialects. ACM Transactions on Asian and Low-Resource Language Information Processing

  9. [9]

    L., Sheng, Y., Zhuang, S., Wu, Z., Zhuang, Y.,

    Zheng, L., Chiang, W. L., Sheng, Y., Zhuang, S., Wu, Z., Zhuang, Y., ... & Stoica, I. (2023). Judging llm-as-a-judge with mt-bench and chatbot arena. Advances in neural informa- tion processing systems, 36, 46595-46623

  10. [10]

    Chatbot Arena: An Open Platform for Evaluating LLMs by Human Preference

    Chiang, W. L., Zheng, L., Sheng, Y., Angelopoulos, A. N., Li, T., Li, D., ... & Stoica, I. (2024). Chatbot arena: An open platform for evaluating llms by human pref- erence. arXiv preprint arXiv:2403.04132

  11. [11]

    & Baldwin, T

    Koto, F., Li, H., Shatnawi, S., Doughman, J., Sadallah, A., Alraeesi, A., ... & Baldwin, T. (2024, August). ArabicMMLU: Assessing massive multitask language understanding in Arabic. In Findings of the Association for Computational Linguistics: ACL 2024 (pp. 5622-5640)

  12. [12]

    Khondaker, M. T. I., Waheed, A., & Abdul- Mageed, M. (2023, December). GPTAraEval: A comprehensive evaluation of ChatGPT on Arabic NLP. In Proceedings of the 2023 Con- ference on Empirical Methods in Natural Language Processing (pp. 220-247)

  13. [13]

    B., & Abdul- Mageed, M

    Elmadany, A., Nagoudi, E. B., & Abdul- Mageed, M. (2023, July). ORCA: A challeng- ing benchmark for Arabic language under- standing. In Findings of the Association for Computational Linguistics: ACL 2023 (pp. 9559-9586)

  14. [14]

    & Sui, Z

    Wang, P., Li, L., Chen, L., Cai, Z., Zhu, D., Lin, B., ... & Sui, Z. (2024, August). Large language models are not fair evalu- ators. In Proceedings of the 62nd Annual Meeting of the Association for Computa- tional Linguistics (Volume 1: Long Papers) (pp. 9440-9450)

  15. [15]

    W., Hou, L., Longpre, S., Zoph, B., Tay, Y., Fedus, W.,

    Chung, H. W., Hou, L., Longpre, S., Zoph, B., Tay, Y., Fedus, W., ... & Wei, J. (2024). Scaling instruction-finetuned language mod- els. Journal of Machine Learning Research, 25(70), 1-53

  16. [16]

    Alayba, A. M. (2025). Arabic Natural Lan- guage Processing (NLP): A Comprehen- sive Review of Challenges, Techniques, and Emerging Trends. Computers

  17. [17]

    (2020, May)

    Antoun, W., Baly, F., & Hajj, H. (2020, May). Arabert: Transformer-based model for arabic language understanding. In Proceed- ings of the 4th workshop on open-source arabic corpora and processing tools, with a shared task on offensive language detection (pp. 9-15)

  18. [18]

    Nagoudi, E. M. B., Elmadany, A., El- Shangiti, A., & Abdul-Mageed, M. (2023). Dolphin: A challenging and diverse bench- mark for Arabic NLG. arXiv preprint arXiv:2305.14989

  19. [19]

    B., Hammoud, H

    Zbib, M. B., Hammoud, H. A. A. K., Mohanna, A., Rizk, N., Karnib, F., Moukaled, S., & Ghanem, B. (2026, March). AraLingBench: A Human-Annotated Bench- mark for Evaluating Arabic Linguistic Capabilities of Large Language Models. In Proceedings of the 2nd Workshop on NLP for Languages Using Arabic Script (pp. 385-393)

  20. [20]

    El Filali, A., Aloui, M., Husaain, T., Alzubaidi, A., Boussaha, B. E. A., Cojocaru, R., Fourrier, C., Habib, N., & Hacid, H. (2025, February 10). The Open Arabic LLM Leaderboard 2. Hugging Face

  21. [21]

    N., Darwish, K

    Almatham, R. N., Darwish, K. M., Al- Rasheed, R., Alshammari, W. T., Alhoshan, M., Almazrua, A., ... & Alosaimy, A. M. (2025, November). BALSAM: A Platform for Benchmarking Arabic Large Language Models. In Proceedings of The Third Ara- bic Natural Language Processing Conference (pp. 258-277)

  22. [22]

    score": <number 0-5>,

    Alsubhi, J., Alahmadi, M. D., Alhusayni, A., Aldailami, I., Hamdine, I., Shabana, A., ... & 14 Khayyat, S. (2025). Optimizing rag pipelines for arabic: A systematic analysis of core com- ponents. arXiv preprint arXiv:2506.06339. 15 A Full Skill-Level Results This appendix presents the complete set of skill- level evaluation results for all models. Scores ...