Instruction-Guided Poetry Generation in Arabic and Its Dialects
Pith reviewed 2026-05-07 05:30 UTC · model grok-4.3
The pith
A new instruction dataset allows fine-tuned language models to generate Arabic poetry that follows user instructions on style, rhyme, and other criteria.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The central claim is that a carefully curated instruction-based dataset for Arabic poetry enables fine-tuned large language models to generate, revise, continue, and analyze poems in ways that match user requirements for style, rhyme, and other criteria, with effectiveness shown through both automated metrics and human evaluation by native speakers.
What carries the argument
The instruction-based dataset for poetry writing, revising, continuing, and analysis tasks in Modern Standard Arabic and various dialects.
If this is right
- Fine-tuned models can help users create and edit Arabic poetry while respecting specific constraints such as rhyme scheme and stylistic requirements.
- The same approach applies to both Modern Standard Arabic and multiple dialects.
- Human evaluations by native speakers provide evidence that the generated poems meet the requested criteria.
- Public availability of the dataset and code supports further development of controllable generation tools for Arabic.
Where Pith is reading between the lines
- Similar instruction datasets could be built for other languages with rich poetic traditions to improve creative writing support in those languages.
- Creative writing applications for Arabic speakers might incorporate these models to lower barriers to composing poetry.
- The work opens a path for testing whether instruction tuning can preserve cultural forms of expression inside general-purpose language models.
- Extending the dataset with audio or performance instructions could connect generated text to spoken or musical poetry traditions.
Load-bearing premise
The curated dataset has enough scale, quality, and dialect coverage that fine-tuning on it produces genuine alignment with instructions, and that ratings from native speakers reliably measure poetic quality and instruction adherence.
What would settle it
A blind evaluation in which native Arabic speakers rate poems from the fine-tuned models as no more aligned with the given instructions or less poetically successful than poems from an unfine-tuned baseline model on the same prompts.
Figures
read the original abstract
Poetry has long been a central art form for Arabic speakers, serving as a powerful medium of expression and cultural identity. While modern Arabic speakers continue to value poetry, existing research on Arabic poetry within Large Language Models (LLMs) has primarily focused on analysis tasks such as interpretation or metadata prediction, e.g., rhyme schemes and titles. In contrast, our work addresses the practical aspect of poetry creation in Arabic by introducing controllable generation capabilities to assist users in writing poetry. Specifically, we present a large-scale, carefully curated instruction-based dataset in Modern Standard Arabic (MSA) and various Arabic dialects. This dataset enables tasks such as writing, revising, and continuing poems based on predefined criteria, including style and rhyme, as well as performing poetry analysis. Our experiments show that fine-tuning LLMs on this dataset yields models that can effectively generate poetry that is aligned with user requirements, based on both automated metrics and human evaluation with native Arabic speakers. The data and the code are available at https://github.com/mbzuai-nlp/instructpoet-ar
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces a large-scale, curated instruction-based dataset for Arabic poetry in Modern Standard Arabic (MSA) and dialects, supporting tasks such as poem writing, revision, continuation, and analysis conditioned on criteria including style and rhyme. The central claim is that fine-tuning LLMs on this dataset produces models capable of generating poetry aligned with user requirements, as evidenced by positive results on automated metrics and human evaluations conducted with native Arabic speakers. The dataset and code are released publicly.
Significance. If the empirical claims hold with adequate documentation, the work would be significant for advancing controllable creative generation in Arabic, a culturally central but resource-scarce domain with substantial dialectal variation. Prior Arabic poetry NLP has focused on analysis tasks; this shifts emphasis to generation and provides an open instruction dataset that could benchmark future efforts in low-resource multilingual creative NLP. The public release of data and code is a clear strength supporting reproducibility.
major comments (3)
- [§4 (Experiments and Evaluation)] §4 (Experiments and Evaluation): The automated metrics used to assess alignment with user instructions (style, rhyme, dialect) are not defined or justified. It is unclear whether custom metrics (e.g., rhyme-scheme accuracy or dialect consistency) or only generic ones (e.g., perplexity, BLEU) were applied. This detail is load-bearing because the claim of 'effective' generation rests directly on these positive metric outcomes.
- [§5 (Human Evaluation)] §5 (Human Evaluation): The human evaluation protocol is underspecified. No information is given on the number of native-speaker annotators, scoring rubric for poetic quality and instruction adherence, sample size of evaluated outputs, blinding procedures, or inter-annotator agreement. Without these, the reliability of the 'positive' human judgments cannot be assessed and the central effectiveness claim is weakened.
- [§3 (Dataset)] §3 (Dataset): While described as 'large-scale' and 'carefully curated,' the manuscript provides insufficient quantitative details on total instruction count, task/dialect distribution, and quality-control steps (e.g., verification of dialectal authenticity). These statistics are necessary to evaluate whether the dataset scale and coverage suffice to support the reported alignment results after fine-tuning.
minor comments (2)
- [Abstract and §1] The abstract and introduction could more explicitly list the exact automated metrics and human-evaluation design choices to allow readers to gauge result strength without reading the full experimental section.
- [Tables/Figures] Figure or table captions for any dataset statistics or evaluation results should include precise definitions of all reported scores.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback, which highlights important areas for improving clarity and rigor. We address each major comment point by point below. All requested details can be added from our experimental records and dataset documentation without altering the core claims or results. We will submit a revised manuscript incorporating these changes.
read point-by-point responses
-
Referee: [§4 (Experiments and Evaluation)] §4 (Experiments and Evaluation): The automated metrics used to assess alignment with user instructions (style, rhyme, dialect) are not defined or justified. It is unclear whether custom metrics (e.g., rhyme-scheme accuracy or dialect consistency) or only generic ones (e.g., perplexity, BLEU) were applied. This detail is load-bearing because the claim of 'effective' generation rests directly on these positive metric outcomes.
Authors: We agree that the automated metrics require explicit definition and justification to support the effectiveness claim. The manuscript reports positive outcomes on metrics evaluating alignment with style, rhyme, and dialect but does not detail their implementation. In the revision, we will add a new subsection in §4 that defines each metric (e.g., rhyme-scheme accuracy via phonetic pattern matching on Arabic endings, dialect consistency via a fine-tuned classifier plus manual sampling, and style adherence via embedding cosine similarity), explains their computation, and justifies their suitability over or alongside generic metrics such as BLEU and perplexity. This will make the evaluation transparent and reproducible. revision: yes
-
Referee: [§5 (Human Evaluation)] §5 (Human Evaluation): The human evaluation protocol is underspecified. No information is given on the number of native-speaker annotators, scoring rubric for poetic quality and instruction adherence, sample size of evaluated outputs, blinding procedures, or inter-annotator agreement. Without these, the reliability of the 'positive' human judgments cannot be assessed and the central effectiveness claim is weakened.
Authors: We acknowledge that the human evaluation protocol is underspecified and that these details are necessary to assess reliability. In the revised §5, we will specify the number of native-speaker annotators and their dialect expertise, provide the full scoring rubric (separate 1-5 scales for instruction adherence on style/rhyme/dialect/task criteria and for poetic quality on fluency/creativity/cultural fit), report the sample size of evaluated outputs, describe blinding (randomized, anonymized presentation), and include inter-annotator agreement statistics. These additions will strengthen the credibility of the positive judgments without changing the reported outcomes. revision: yes
-
Referee: [§3 (Dataset)] §3 (Dataset): While described as 'large-scale' and 'carefully curated,' the manuscript provides insufficient quantitative details on total instruction count, task/dialect distribution, and quality-control steps (e.g., verification of dialectal authenticity). These statistics are necessary to evaluate whether the dataset scale and coverage suffice to support the reported alignment results after fine-tuning.
Authors: We agree that quantitative details on dataset scale, distribution, and quality control are needed to substantiate the 'large-scale' and 'carefully curated' descriptions and to link them to the fine-tuning results. In the revision, §3 will be expanded with a summary table and text reporting the total instruction count, the breakdown by task (writing, revision, continuation, analysis) and by dialect (MSA and specific dialects), and the quality-control pipeline (including automated filters followed by native-speaker verification of dialect authenticity and task correctness). This will allow readers to evaluate coverage and sufficiency. revision: yes
Circularity Check
No circularity in empirical dataset creation and evaluation
full rationale
The paper's core contribution is the construction of a large-scale instruction-based dataset for Arabic poetry tasks (writing, revising, continuing, and analysis) followed by standard LLM fine-tuning. Effectiveness is asserted via automated metrics and human judgments from native speakers, which function as independent external benchmarks rather than quantities defined by the dataset itself. No equations, self-definitional loops, fitted parameters renamed as predictions, or load-bearing self-citations appear in the provided text; the pipeline remains self-contained against external validation and does not reduce any claimed result to its inputs by construction.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Instruction tuning on high-quality curated datasets produces models whose outputs align with user-specified constraints in creative text generation
Reference graph
Works this paper leans on
-
[1]
arXiv preprint arXiv:2501.13944
Fanar: An Arabic-centric multimodal generative AI plat- form. ArXiv preprint, arXiv:2501.13944. Munef Abdullah Ahmed, Raed Abdulkareem Hasan, Mostafa Abdulghafoor Mohammed, Pe- ter Mwangi, and Tirus Muya
-
[2]
In Proceedings of the 2025 Con- ference on Empirical Methods in Natural Lan- guage Processing, EMNLP ’25, pages 20224– 20244, Suzhou, China
Fann or Flop: A multigenre, mul- tiera benchmark for Arabic poetry understand- ing in LLMs . In Proceedings of the 2025 Con- ference on Empirical Methods in Natural Lan- guage Processing, EMNLP ’25, pages 20224– 20244, Suzhou, China. Association for Compu- tational Linguistics. Muhsin J Al-Musawi
2025
-
[3]
ArXiv preprint , arXiv:2307.06218
Ashaar: Automatic analysis and generation of Arabic poetry us- ing deep learning approaches . ArXiv preprint , arXiv:2307.06218. Ron Artstein and Massimo Poesio
-
[4]
In International Conference on Learning Repre- sentations, volume 2025, pages 34179–34214
Allam: Large language models for arabic and english . In International Conference on Learning Repre- sentations, volume 2025, pages 34179–34214. Jonas Belouadi and Steffen Eger
2025
-
[5]
In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Pro- cessing, EMNLP ’22, pages 6848–6863, Abu Dhabi, United Arab Emirates
Help me write a poem: Instruc- tion tuning as a vehicle for collaborative poetry writing. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Pro- cessing, EMNLP ’22, pages 6848–6863, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics. Tuhin Chakrabarty, Vishakh Padmakumar, He He, and Nanyun Peng
2022
-
[6]
In Proceedings of the 2023 Conference on Empirical Methods in Nat- ural Language Processing: Tutorial Abstracts , EMNLP ’23, pages 34–40, Singapore
Creative natural language generation . In Proceedings of the 2023 Conference on Empirical Methods in Nat- ural Language Processing: Tutorial Abstracts , EMNLP ’23, pages 34–40, Singapore. Associa- tion for Computational Linguistics. Jacob Cohen
2023
-
[7]
The Llama 3 herd of models . ArXiv preprint, abs/2407.21783. Hani D. Hejazi, Ahmed A. Khamees, Muham- mad Turki Alshurideh, and Said A. Salloum
work page internal anchor Pith review arXiv
-
[8]
In Findings of the Association for Computa- tional Linguistics: EMNLP 2022 , EMNLP ’22, pages 3655–3670, Abu Dhabi, United Arab Emi- rates
PoeLM: A meter- and rhyme-controllable lan- guage model for unsupervised poetry genera- tion. In Findings of the Association for Computa- tional Linguistics: EMNLP 2022 , EMNLP ’22, pages 3655–3670, Abu Dhabi, United Arab Emi- rates. Association for Computational Linguis- tics. Faisal Qarah
2022
-
[9]
ArXiv preprint, arXiv:2403.12392
AraPoemBERT: A pre- trained language model for Arabic poetry anal- ysis. ArXiv preprint, arXiv:2403.12392. Sakib Shahriar, Noora Al Roken, and Imran Zualk- ernan
-
[10]
In Proceedings of the 26th In- ternational Conference on Computational Lin- guistics: Technical Papers, COLING ’16, pages 1051–1060, Osaka, Japan
Chi- nese poetry generation with planning based neu- ral network . In Proceedings of the 26th In- ternational Conference on Computational Lin- guistics: Technical Papers, COLING ’16, pages 1051–1060, Osaka, Japan. The COLING 2016 Organizing Committee. Linting Xue, Aditya Barua, Noah Constant, Rami Al-Rfou, Sharan Narang, Mihir Kale, Adam Roberts, and Colin Raffel
2016
-
[11]
Qwen3 techni- cal report. ArXiv preprint, abs/2505.09388. Xiaoyuan Yi, Maosong Sun, Ruoyu Li, and Wen- hao Li
work page internal anchor Pith review arXiv
-
[12]
In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing , EMNLP ’18, pages 3143–3153, Brussels, Belgium
Automatic poetry generation with mutual reinforcement learning . In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing , EMNLP ’18, pages 3143–3153, Brussels, Belgium. Associa- tion for Computational Linguistics. Chengyue Yu, Lei Zang, Jiaotuan Wang, Chenyi Zhuang, and Jinjie Gu
2018
-
[13]
Llm-based multi-agent poetry generation in non-cooperative environments,
LLM- based multi-agent poetry generation in non- cooperative environments . ArXiv preprint , arXiv:2409.03659. Xingxing Zhang and Mirella Lapata
-
[14]
In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Pro- cessing, EMNLP ’14, pages 670–680, Doha, Qatar
Chi- nese poetry generation with recurrent neural net- works. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Pro- cessing, EMNLP ’14, pages 670–680, Doha, Qatar. Association for Computational Linguis- tics. Lianghui Zhu, Xinggang Wang, and Xinlong Wang
2014
-
[15]
In Interna- tional Conference on Learning Representations, volume 2025, pages 51257–51296
Judgelm: Fine-tuned large lan- guage models are scalable judges . In Interna- tional Conference on Learning Representations, volume 2025, pages 51257–51296. Michael Zwettler
2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.