Recognition: unknown
MORPHOGEN: A Multilingual Benchmark for Evaluating Gender-Aware Morphological Generation
Pith reviewed 2026-05-10 04:01 UTC · model grok-4.3
The pith
Multilingual LLMs show significant gaps when rewriting first-person sentences to flip grammatical gender in French, Arabic, and Hindi.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
MORPHOGEN provides a morphologically grounded dataset spanning French, Arabic, and Hindi. Its core GENFORM task measures whether models can transform a first-person sentence into the opposite gender without altering meaning or structure. Benchmark results on fifteen models ranging from 2B to 70B parameters reveal consistent failures in producing correct verb conjugations, pronouns, and agreement patterns.
What carries the argument
The GENFORM task, which requires models to rewrite a first-person sentence in the opposite gender while preserving its meaning and structure, acts as the diagnostic test for gender-aware morphological generation.
If this is right
- Models require targeted improvements in morphological agreement rules to succeed at gender transformations.
- The benchmark supplies a repeatable diagnostic that future model releases can be measured against.
- Performance differences across languages highlight which grammatical systems current training data covers least well.
- Insights from the results can guide development of generation systems that respect explicit and implicit gender cues.
Where Pith is reading between the lines
- Extending the same transformation test to additional gendered languages could reveal whether the observed gaps are universal or language-specific.
- Integrating the GENFORM task into training loops might produce models that handle gender morphology more reliably without separate fine-tuning stages.
- Real-world applications such as machine translation or dialogue systems could adopt similar checks to reduce gender errors before deployment.
- The approach of testing first-person gender flips isolates a narrow but measurable aspect of inclusivity that broader benchmarks often overlook.
Load-bearing premise
The synthetic dataset accurately mirrors real grammatical gender agreement and first-person constructions in the three languages without introducing artificial patterns.
What would settle it
High accuracy across all fifteen models on the GENFORM task for French, Arabic, and Hindi would show that the claimed gaps do not exist.
Figures
read the original abstract
While multilingual large language models (LLMs) perform well on high-level tasks like translation and question answering, their ability to handle grammatical gender and morphological agreement remains underexplored. In morphologically rich languages, gender influences verb conjugation, pronouns, and even first-person constructions with explicit and implicit mentions of gender. We introduce MORPHOGEN, a morphologically grounded large-scale benchmark dataset for evaluating gender-aware generation in three typologically diverse grammatically gendered languages: French, Arabic, and Hindi. The core task, GENFORM, requires models to rewrite a first-person sentence in the opposite gender while preserving its meaning and structure. We construct a high-quality synthetic dataset spanning these three languages and benchmark 15 popular multilingual LLMs (2B-70B) on their ability to perform this transformation. Our results reveal significant gaps and interesting insights into how current models handle morphological gender. MORPHOGEN provides a focused diagnostic lens for gender-aware language modeling and lays the groundwork for future research on inclusive and morphology-sensitive NLP.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces MORPHOGEN, a multilingual benchmark for gender-aware morphological generation in French, Arabic, and Hindi. The core GENFORM task requires LLMs to rewrite first-person sentences to the opposite gender while preserving meaning and structure. A high-quality synthetic dataset is constructed across the three languages, and 15 multilingual LLMs (2B–70B parameters) are evaluated, with results showing significant performance gaps and insights into current models' handling of morphological gender.
Significance. If the synthetic dataset is shown to faithfully encode real grammatical gender agreement and meaning-preserving rewrites, this benchmark would provide a focused diagnostic for an underexplored capability in multilingual LLMs. The multi-language, multi-model evaluation could guide future work on morphology-sensitive and inclusive NLP systems.
major comments (2)
- [§3 (Dataset Construction)] §3 (Dataset Construction): The headline claim of model-induced gaps on GENFORM requires that the synthetic dataset accurately reflects real first-person gender agreement patterns without artifacts (e.g., unnatural verb forms or inconsistent triggers). The section must detail the construction process, any LLM-assisted generation, templates used, and validation steps such as native-speaker annotation or comparison to attested corpora.
- [§5 (Results)] §5 (Results): The reported 'significant gaps' across the 15 LLMs are presented without statistical significance tests, confidence intervals, or error analysis broken down by language or error type. This makes it difficult to assess whether observed differences are robust or could be dataset-induced.
minor comments (2)
- [Abstract] Abstract: The phrase 'interesting insights' is vague; briefly enumerating one or two key observations (e.g., language-specific failure modes) would improve informativeness.
- [Results tables] Table 2 or equivalent results table: Ensure consistent reporting of per-language scores and overall averages with the same number of decimal places.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on our manuscript. We address each major comment below and will revise the paper to incorporate the suggested improvements for clarity and rigor.
read point-by-point responses
-
Referee: [§3 (Dataset Construction)] §3 (Dataset Construction): The headline claim of model-induced gaps on GENFORM requires that the synthetic dataset accurately reflects real first-person gender agreement patterns without artifacts (e.g., unnatural verb forms or inconsistent triggers). The section must detail the construction process, any LLM-assisted generation, templates used, and validation steps such as native-speaker annotation or comparison to attested corpora.
Authors: We agree that detailed documentation of the dataset construction is critical to support the validity of the benchmark and the observed model gaps. The current §3 provides an overview of the synthetic data generation process for first-person sentences and their gender-reversed versions across French, Arabic, and Hindi. To address this, we will substantially expand the section to include: explicit language-specific templates used for sentence generation; full details on any LLM-assisted steps (including prompts and post-editing); results from native-speaker validation (we performed annotations by multiple native speakers for grammatical accuracy, naturalness, and meaning preservation, with inter-annotator agreement metrics); and comparisons to attested examples from corpora where available. These additions will confirm the absence of artifacts in gender agreement patterns. revision: yes
-
Referee: [§5 (Results)] §5 (Results): The reported 'significant gaps' across the 15 LLMs are presented without statistical significance tests, confidence intervals, or error analysis broken down by language or error type. This makes it difficult to assess whether observed differences are robust or could be dataset-induced.
Authors: We acknowledge that the results section would benefit from greater statistical rigor and error breakdown to strengthen claims about model performance gaps. We will revise §5 to include: statistical significance tests (such as McNemar's test for paired model comparisons) with p-values and adjustments for multiple comparisons; confidence intervals for all reported metrics; and a comprehensive error analysis disaggregated by language (French, Arabic, Hindi) and by error categories (e.g., verb morphology errors, pronoun mismatches, semantic drift). This analysis, which we have conducted internally, shows consistent patterns that support the robustness of the gaps beyond potential dataset artifacts. revision: yes
Circularity Check
No circularity: purely empirical benchmark evaluation
full rationale
The paper introduces the MORPHOGEN benchmark and the GENFORM task of rewriting first-person sentences to opposite gender while preserving meaning. It describes construction of a synthetic dataset for French, Arabic, and Hindi, then reports empirical results on 15 LLMs. There are no equations, fitted parameters, first-principles derivations, or predictions that reduce to inputs by construction. No self-citations serve as load-bearing justifications for uniqueness or ansatzes. The work is self-contained as a measurement study on the provided data; dataset quality concerns affect correctness but not circularity of any derivation chain.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
The llama 3 herd of models , author=. arXiv preprint arXiv:2407.21783 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[2]
2025 , eprint=
Attributing Culture-Conditioned Generations to Pretraining Corpora , author=. 2025 , eprint=
2025
-
[3]
2024 , eprint=
CrossVoice: Crosslingual Prosody Preserving Cascade-S2ST using Transfer Learning , author=. 2024 , eprint=
2024
-
[4]
Proceedings of the AAAI conference on artificial intelligence , volume=
LLMGuard: guarding against unsafe LLM behavior , author=. Proceedings of the AAAI conference on artificial intelligence , volume=
-
[5]
2024 , eprint=
Multilingual Prosody Transfer: Comparing Supervised & Transfer Learning , author=. 2024 , eprint=
2024
-
[6]
Kapuriya, Janak and Shaikh, Anwar and Goel, Arnav and Hira, Medha and Singh, Apoorv and Saraf, Jay and Sanjana and Nauriyal, Vaibhav and Anand, Avinash and Wang, Zhengkui and Shah, Rajiv Ratn , title =. Proceedings of the 2nd International Workshop on Large Vision - Language Model Learning and Applications , pages =. 2025 , isbn =. doi:10.1145/3728483.376...
-
[7]
Anand, Avinash and Goel, Arnav and Hira, Medha and Buldeo, Snehal and Kumar, Jatin and Verma, Astha and Gupta, Rushali and Shah, Rajiv Ratn , title =. Big Data and Artificial Intelligence: 11th International Conference, BDA 2023, Delhi, India, December 7–9, 2023, Proceedings , pages =. 2023 , isbn =. doi:10.1007/978-3-031-49601-1_4 , abstract =
-
[8]
CrossVoice: Crosslingual Prosody Preserving Cascade-S2
Medha Hira and Arnav Goel and Anubha Gupta , booktitle=. CrossVoice: Crosslingual Prosody Preserving Cascade-S2. 2024 , url=
2024
-
[9]
2024 , eprint=
Exploring Multilingual Unseen Speaker Emotion Recognition: Leveraging Co-Attention Cues in Multitask Learning , author=. 2024 , eprint=
2024
-
[10]
2022 , eprint=
No Language Left Behind: Scaling Human-Centered Machine Translation , author=. 2022 , eprint=
2022
-
[11]
Fine-grained Gender Control in Machine Translation with Large Language Models
Lee, Minwoo and Koh, Hyukhun and Kim, Minsung and Jung, Kyomin. Fine-grained Gender Control in Machine Translation with Large Language Models. Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers). 2024. doi:10.18653/v1/2024.naacl-long.303
-
[12]
2024 , eprint=
What an Elegant Bridge: Multilingual LLMs are Biased Similarly in Different Languages , author=. 2024 , eprint=
2024
-
[13]
A case study of natural gender phenomena in translation: A comparison of google translate, bing microsoft translator and deepl for english to Italian, French and Spanish
Rescigno, Argentina Anna and Eva Vanmassenhove and Johanna Monti and Andy Way. A case study of natural gender phenomena in translation: A comparison of google translate, bing microsoft translator and deepl for english to Italian, French and Spanish. CEUR Workshop Proceedings. 2020
2020
-
[14]
2025 , eprint=
Large Language Models and Arabic Content: A Review , author=. 2025 , eprint=
2025
-
[15]
2019 , eprint=
How does Grammatical Gender Affect Noun Representations in Gender-Marking Languages? , author=. 2019 , eprint=
2019
-
[16]
2024 , eprint=
Women Are Beautiful, Men Are Leaders: Gender Stereotypes in Machine Translation and Language Modeling , author=. 2024 , eprint=
2024
-
[17]
2020 , eprint=
Gender in Danger? Evaluating Speech Translation Technology on the MuST-SHE Corpus , author=. 2020 , eprint=
2020
-
[18]
Currey, Anna and Nadejde, Maria and Pappagari, Raghavendra Reddy and Mayer, Mia and Lauly, Stanislas and Niu, Xing and Hsu, Benjamin and Dinu, Georgiana. MT - G en E val: A Counterfactual and Contextual Dataset for Evaluating Gender Accuracy in Machine Translation. Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing. 202...
-
[19]
2025 , eprint=
mGeNTE: A Multilingual Resource for Gender-Neutral Language and Translation , author=. 2025 , eprint=
2025
-
[20]
2024 , eprint=
Akal Badi ya Bias: An Exploratory Study of Gender Bias in Hindi Language Technology , author=. 2024 , eprint=
2024
-
[21]
2024 , eprint=
'Since Lawyers are Males..': Examining Implicit Gender Bias in Hindi Language Generation by LLMs , author=. 2024 , eprint=
2024
-
[22]
Koehn , title =
P. Koehn , title =. ACL Anthology , pages =. 2005 , url =
2005
-
[23]
2025 , eprint=
GenderCARE: A Comprehensive Framework for Assessing and Reducing Gender Bias in Large Language Models , author=. 2025 , eprint=
2025
-
[24]
2024 , eprint=
The power of Prompts: Evaluating and Mitigating Gender Bias in MT with LLMs , author=. 2024 , eprint=
2024
-
[25]
arXiv preprint arXiv:2409.13484 , year=
'Since Lawyers are Males..': Examining Implicit Gender Bias in Hindi Language Generation by LLMs , author=. arXiv preprint arXiv:2409.13484 , year=
-
[26]
2024 , eprint=
Enhancing Gender-Inclusive Machine Translation with Neomorphemes and Large Language Models , author=. 2024 , eprint=
2024
-
[27]
arXiv preprint arXiv:2403.00277 , year=
Gender bias in large language models across multiple languages , author=. arXiv preprint arXiv:2403.00277 , year=
-
[28]
2025 , eprint=
The Bitter Lesson Learned from 2,000+ Multilingual Benchmarks , author=. 2025 , eprint=
2025
-
[29]
arXiv preprint arXiv:2503.20302 , year=
A Multilingual, Culture-First Approach to Addressing Misgendering in LLM Applications , author=. arXiv preprint arXiv:2503.20302 , year=
-
[30]
2025 , eprint=
MM-Eval: A Multilingual Meta-Evaluation Benchmark for LLM-as-a-Judge and Reward Models , author=. 2025 , eprint=
2025
-
[31]
Singh, Harman and Gupta, Nitish and Bharadwaj, Shikhar and Tewari, Dinesh and Talukdar, Partha. I ndic G en B ench: A Multilingual Benchmark to Evaluate Generation Capabilities of LLM s on I ndic Languages. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2024. doi:10.18653/v1/2024.acl-long.595
-
[32]
2024 , eprint=
Multilingual Large Language Model: A Survey of Resources, Taxonomy and Frontiers , author=. 2024 , eprint=
2024
-
[33]
2025 , eprint=
BenchMAX: A Comprehensive Multilingual Evaluation Suite for Large Language Models , author=. 2025 , eprint=
2025
-
[34]
Global mmlu: Understanding and addressing cultural and linguistic biases in multilingual evaluation , author=. arXiv preprint arXiv:2412.03304 , year=
-
[35]
International conference on machine learning , pages=
Xtreme: A massively multilingual multi-task benchmark for evaluating cross-lingual generalisation , author=. International conference on machine learning , pages=. 2020 , organization=
2020
-
[36]
2025 , eprint=
A Survey on Large Language Models with Multilingualism: Recent Advances and New Frontiers , author=. 2025 , eprint=
2025
-
[37]
A survey on multilingual large language models: corpora, alignment, and bias , volume=
Xu, Yuemei and Hu, Ling and Zhao, Jiayi and Qiu, Zihan and Xu, Kexin and Ye, Yuqi and Gu, Hanwen , year=. A survey on multilingual large language models: corpora, alignment, and bias , volume=. Frontiers of Computer Science , publisher=. doi:10.1007/s11704-024-40579-4 , number=
-
[38]
2023 , eprint=
Advancements in Scientific Controllable Text Generation Methods , author=. 2023 , eprint=
2023
-
[39]
No Language Left Behind: Scaling Human-Centered Machine Translation
No language left behind: Scaling human-centered machine translation , author=. arXiv preprint arXiv:2207.04672 , year=
work page internal anchor Pith review arXiv
-
[40]
Gpt-4o system card , author=. arXiv preprint arXiv:2410.21276 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[41]
arXiv preprint arXiv:2305.16307 , year=
Indictrans2: Towards high-quality and accessible machine translation models for all 22 scheduled indian languages , author=. arXiv preprint arXiv:2305.16307 , year=
-
[42]
Qwen3 Technical Report , author=. arXiv preprint arXiv:2505.09388 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[43]
Gemma 3 technical report , author=. arXiv preprint arXiv:2503.19786 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[44]
Gemma 2: Improving Open Language Models at a Practical Size
Gemma 2: Improving open language models at a practical size , author=. arXiv preprint arXiv:2408.00118 , year=
work page internal anchor Pith review arXiv
-
[45]
Phi-4 technical report , author=. arXiv preprint arXiv:2412.08905 , year=
work page internal anchor Pith review arXiv
-
[46]
Stanovsky, Gabriel and Smith, Noah A. and Zettlemoyer, Luke. Evaluating Gender Bias in Machine Translation. Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. 2019. doi:10.18653/v1/P19-1164
-
[47]
arXiv preprint arXiv:2410.15037 , year=
mHumanEval--A Multilingual Benchmark to Evaluate Large Language Models for Code Generation , author=. arXiv preprint arXiv:2410.15037 , year=
-
[48]
2025 , eprint=
GlotEval: A Test Suite for Massively Multilingual Evaluation of Large Language Models , author=. 2025 , eprint=
2025
-
[49]
Proceedings of the 13th International Conference on Language Resources and Evaluation (LREC) , year =
The Arabic Parallel Gender Corpus 2.0: Extensions and Analyses , author =. Proceedings of the 13th International Conference on Language Resources and Evaluation (LREC) , year =
-
[50]
Proceedings of the 10th International Conference on Language Resources and Evaluation (LREC) , year =
OpenSubtitles2016: Extracting Large Parallel Corpora from Movie and TV Subtitles , author =. Proceedings of the 10th International Conference on Language Resources and Evaluation (LREC) , year =
-
[51]
I'm sorry to hear that
"I'm sorry to hear that": Finding New Biases in Language Models with a Holistic Descriptor Dataset , author =. Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (ACL) , year =
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.