arxiv: 2604.18914 · v1 · submitted 2026-04-20 · 💻 cs.CL · cs.AI· cs.LG

Recognition: unknown

MORPHOGEN: A Multilingual Benchmark for Evaluating Gender-Aware Morphological Generation

Mehul Agarwal , Aditya Aggarwal , Arnav Goel , Medha Hira , Anubha Gupta

Authors on Pith no claims yet

Pith reviewed 2026-05-10 04:01 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.LG

keywords gender-aware generationmorphological generationgrammatical gendermultilingual benchmarkfirst-person constructionsFrench Arabic HindiLLM evaluation

0 comments

The pith

Multilingual LLMs show significant gaps when rewriting first-person sentences to flip grammatical gender in French, Arabic, and Hindi.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a benchmark called MORPHOGEN to test how current language models handle the grammatical rules of gender in languages where it affects verbs, pronouns, and sentence structure. The central evaluation task requires a model to take a first-person sentence and rewrite it in the opposite gender while keeping the exact meaning and form intact. By creating a large synthetic dataset for three typologically different languages and running fifteen popular models on it, the work demonstrates that these models frequently produce incorrect morphological forms. A reader would care because accurate gender handling matters for any application that generates natural text in these languages, from translation tools to conversational systems.

Core claim

MORPHOGEN provides a morphologically grounded dataset spanning French, Arabic, and Hindi. Its core GENFORM task measures whether models can transform a first-person sentence into the opposite gender without altering meaning or structure. Benchmark results on fifteen models ranging from 2B to 70B parameters reveal consistent failures in producing correct verb conjugations, pronouns, and agreement patterns.

What carries the argument

The GENFORM task, which requires models to rewrite a first-person sentence in the opposite gender while preserving its meaning and structure, acts as the diagnostic test for gender-aware morphological generation.

If this is right

Models require targeted improvements in morphological agreement rules to succeed at gender transformations.
The benchmark supplies a repeatable diagnostic that future model releases can be measured against.
Performance differences across languages highlight which grammatical systems current training data covers least well.
Insights from the results can guide development of generation systems that respect explicit and implicit gender cues.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Extending the same transformation test to additional gendered languages could reveal whether the observed gaps are universal or language-specific.
Integrating the GENFORM task into training loops might produce models that handle gender morphology more reliably without separate fine-tuning stages.
Real-world applications such as machine translation or dialogue systems could adopt similar checks to reduce gender errors before deployment.
The approach of testing first-person gender flips isolates a narrow but measurable aspect of inclusivity that broader benchmarks often overlook.

Load-bearing premise

The synthetic dataset accurately mirrors real grammatical gender agreement and first-person constructions in the three languages without introducing artificial patterns.

What would settle it

High accuracy across all fifteen models on the GENFORM task for French, Arabic, and Hindi would show that the claimed gaps do not exist.

Figures

Figures reproduced from arXiv: 2604.18914 by Aditya Aggarwal, Anubha Gupta, Arnav Goel, Medha Hira, Mehul Agarwal.

**Figure 2.** Figure 2: Gendered Terms Distribution in MORPHOGEN 3.2 Task Formulation For the proposed GENFORM task on MORPHOGEN, we prompt a multilingual LLM with a first-person sentence to rewrite the sentence in the opposite gender, i.e., from masculine to feminine or vice versa, based on the original speaker’s gender. The model must correctly apply language-specific morphological rules while preserving the sentence’s meani… view at source ↗

**Figure 3.** Figure 3: General morphological rules for grammati [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

**Figure 4.** Figure 4: Distribution of Sentence Frequency Per Morphological Rule for Each Language [PITH_FULL_IMAGE:figures/full_fig_p013_4.png] view at source ↗

**Figure 5.** Figure 5: △SGA (Accuracy Gap) across all models and languages (French, Arabic, Hindi) in the MORPHOGEN benchmark. Positive values indicate masculine bias, while negative values indicate feminine bias. the modified sentence with no explanations or extra words. If no change is required, return the sentence exactly as it is.“ The user prompt provided the transformation instruction, depending on the speaker’s gender: •… view at source ↗

**Figure 6.** Figure 6: Rule-based and model-wise IoU metrics across all three languages. [PITH_FULL_IMAGE:figures/full_fig_p023_6.png] view at source ↗

**Figure 7.** Figure 7: Example of results of LLAMA family of models on multiple entities in French dataset. [PITH_FULL_IMAGE:figures/full_fig_p024_7.png] view at source ↗

**Figure 8.** Figure 8: Example of results of LLAMA family of models on multiple entities in Arabic dataset. [PITH_FULL_IMAGE:figures/full_fig_p024_8.png] view at source ↗

**Figure 9.** Figure 9: Example of results of LLAMA family of models on multiple entities in Hindi dataset. [PITH_FULL_IMAGE:figures/full_fig_p025_9.png] view at source ↗

read the original abstract

While multilingual large language models (LLMs) perform well on high-level tasks like translation and question answering, their ability to handle grammatical gender and morphological agreement remains underexplored. In morphologically rich languages, gender influences verb conjugation, pronouns, and even first-person constructions with explicit and implicit mentions of gender. We introduce MORPHOGEN, a morphologically grounded large-scale benchmark dataset for evaluating gender-aware generation in three typologically diverse grammatically gendered languages: French, Arabic, and Hindi. The core task, GENFORM, requires models to rewrite a first-person sentence in the opposite gender while preserving its meaning and structure. We construct a high-quality synthetic dataset spanning these three languages and benchmark 15 popular multilingual LLMs (2B-70B) on their ability to perform this transformation. Our results reveal significant gaps and interesting insights into how current models handle morphological gender. MORPHOGEN provides a focused diagnostic lens for gender-aware language modeling and lays the groundwork for future research on inclusive and morphology-sensitive NLP.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

MORPHOGEN adds a focused benchmark for gender morphology flips in three languages, but the synthetic data quality is the make-or-break point.

read the letter

The main thing to know is that this paper puts forward MORPHOGEN and the GENFORM task, where models must rewrite first-person sentences into the opposite gender while keeping meaning intact, then tests 15 LLMs on French, Arabic, and Hindi. That setup targets a specific morphological gap that broader multilingual benchmarks usually skip over. The choice of languages with different gender systems is sensible, and running the evaluation across model sizes from 2B to 70B gives a practical picture of where current systems fall short on agreement and pronoun handling. The paper does a clean job framing the task and reporting the performance differences without overclaiming theory. What it actually contributes is a ready-to-use diagnostic set for people who care about morphology in gendered languages. The soft spot is the dataset. It is synthetic, and the stress-test note is on target: if the rewrites contain unnatural verb forms, inconsistent triggers, or non-native first-person phrasing, the measured gaps could come from the data rather than the models. The abstract calls the set high-quality, but without clear human validation steps, error rates on real corpora, or side-by-side checks against native usage, it is hard to know how much weight to give the results. Minor issues like missing statistical tests on the gaps would also be easy to fix. This paper is for researchers who build or evaluate multilingual LLMs and want a narrow, reproducible test for gender morphology. Readers working on fairness diagnostics or low-resource morphology will find the task definition and language coverage useful even if they end up rebuilding parts of the data. It is coherent on its own terms and shows honest engagement with the existing literature on LLM limitations, so it deserves a serious referee. I would send it to review and ask specifically for the dataset construction details and any native-speaker checks.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces MORPHOGEN, a multilingual benchmark for gender-aware morphological generation in French, Arabic, and Hindi. The core GENFORM task requires LLMs to rewrite first-person sentences to the opposite gender while preserving meaning and structure. A high-quality synthetic dataset is constructed across the three languages, and 15 multilingual LLMs (2B–70B parameters) are evaluated, with results showing significant performance gaps and insights into current models' handling of morphological gender.

Significance. If the synthetic dataset is shown to faithfully encode real grammatical gender agreement and meaning-preserving rewrites, this benchmark would provide a focused diagnostic for an underexplored capability in multilingual LLMs. The multi-language, multi-model evaluation could guide future work on morphology-sensitive and inclusive NLP systems.

major comments (2)

[§3 (Dataset Construction)] §3 (Dataset Construction): The headline claim of model-induced gaps on GENFORM requires that the synthetic dataset accurately reflects real first-person gender agreement patterns without artifacts (e.g., unnatural verb forms or inconsistent triggers). The section must detail the construction process, any LLM-assisted generation, templates used, and validation steps such as native-speaker annotation or comparison to attested corpora.
[§5 (Results)] §5 (Results): The reported 'significant gaps' across the 15 LLMs are presented without statistical significance tests, confidence intervals, or error analysis broken down by language or error type. This makes it difficult to assess whether observed differences are robust or could be dataset-induced.

minor comments (2)

[Abstract] Abstract: The phrase 'interesting insights' is vague; briefly enumerating one or two key observations (e.g., language-specific failure modes) would improve informativeness.
[Results tables] Table 2 or equivalent results table: Ensure consistent reporting of per-language scores and overall averages with the same number of decimal places.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment below and will revise the paper to incorporate the suggested improvements for clarity and rigor.

read point-by-point responses

Referee: [§3 (Dataset Construction)] §3 (Dataset Construction): The headline claim of model-induced gaps on GENFORM requires that the synthetic dataset accurately reflects real first-person gender agreement patterns without artifacts (e.g., unnatural verb forms or inconsistent triggers). The section must detail the construction process, any LLM-assisted generation, templates used, and validation steps such as native-speaker annotation or comparison to attested corpora.

Authors: We agree that detailed documentation of the dataset construction is critical to support the validity of the benchmark and the observed model gaps. The current §3 provides an overview of the synthetic data generation process for first-person sentences and their gender-reversed versions across French, Arabic, and Hindi. To address this, we will substantially expand the section to include: explicit language-specific templates used for sentence generation; full details on any LLM-assisted steps (including prompts and post-editing); results from native-speaker validation (we performed annotations by multiple native speakers for grammatical accuracy, naturalness, and meaning preservation, with inter-annotator agreement metrics); and comparisons to attested examples from corpora where available. These additions will confirm the absence of artifacts in gender agreement patterns. revision: yes
Referee: [§5 (Results)] §5 (Results): The reported 'significant gaps' across the 15 LLMs are presented without statistical significance tests, confidence intervals, or error analysis broken down by language or error type. This makes it difficult to assess whether observed differences are robust or could be dataset-induced.

Authors: We acknowledge that the results section would benefit from greater statistical rigor and error breakdown to strengthen claims about model performance gaps. We will revise §5 to include: statistical significance tests (such as McNemar's test for paired model comparisons) with p-values and adjustments for multiple comparisons; confidence intervals for all reported metrics; and a comprehensive error analysis disaggregated by language (French, Arabic, Hindi) and by error categories (e.g., verb morphology errors, pronoun mismatches, semantic drift). This analysis, which we have conducted internally, shows consistent patterns that support the robustness of the gaps beyond potential dataset artifacts. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical benchmark evaluation

full rationale

The paper introduces the MORPHOGEN benchmark and the GENFORM task of rewriting first-person sentences to opposite gender while preserving meaning. It describes construction of a synthetic dataset for French, Arabic, and Hindi, then reports empirical results on 15 LLMs. There are no equations, fitted parameters, first-principles derivations, or predictions that reduce to inputs by construction. No self-citations serve as load-bearing justifications for uniqueness or ansatzes. The work is self-contained as a measurement study on the provided data; dataset quality concerns affect correctness but not circularity of any derivation chain.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No mathematical model or derivation; the paper is an empirical benchmark study relying on standard assumptions about language morphology and LLM evaluation.

pith-pipeline@v0.9.0 · 5492 in / 961 out tokens · 30622 ms · 2026-05-10T04:01:08.686237+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

51 extracted references · 20 canonical work pages · 7 internal anchors

[1]

The Llama 3 Herd of Models

The llama 3 herd of models , author=. arXiv preprint arXiv:2407.21783 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[2]

2025 , eprint=

Attributing Culture-Conditioned Generations to Pretraining Corpora , author=. 2025 , eprint=

2025
[3]

2024 , eprint=

CrossVoice: Crosslingual Prosody Preserving Cascade-S2ST using Transfer Learning , author=. 2024 , eprint=

2024
[4]

Proceedings of the AAAI conference on artificial intelligence , volume=

LLMGuard: guarding against unsafe LLM behavior , author=. Proceedings of the AAAI conference on artificial intelligence , volume=
[5]

2024 , eprint=

Multilingual Prosody Transfer: Comparing Supervised & Transfer Learning , author=. 2024 , eprint=

2024
[6]

Proceedings of the 2nd International Workshop on Large Vision - Language Model Learning and Applications , pages =

Kapuriya, Janak and Shaikh, Anwar and Goel, Arnav and Hira, Medha and Singh, Apoorv and Saraf, Jay and Sanjana and Nauriyal, Vaibhav and Anand, Avinash and Wang, Zhengkui and Shah, Rajiv Ratn , title =. Proceedings of the 2nd International Workshop on Large Vision - Language Model Learning and Applications , pages =. 2025 , isbn =. doi:10.1145/3728483.376...

work page doi:10.1145/3728483.3760194 2025
[7]

Big Data and Artificial Intelligence: 11th International Conference, BDA 2023, Delhi, India, December 7–9, 2023, Proceedings , pages =

Anand, Avinash and Goel, Arnav and Hira, Medha and Buldeo, Snehal and Kumar, Jatin and Verma, Astha and Gupta, Rushali and Shah, Rajiv Ratn , title =. Big Data and Artificial Intelligence: 11th International Conference, BDA 2023, Delhi, India, December 7–9, 2023, Proceedings , pages =. 2023 , isbn =. doi:10.1007/978-3-031-49601-1_4 , abstract =

work page doi:10.1007/978-3-031-49601-1_4 2023
[8]

CrossVoice: Crosslingual Prosody Preserving Cascade-S2

Medha Hira and Arnav Goel and Anubha Gupta , booktitle=. CrossVoice: Crosslingual Prosody Preserving Cascade-S2. 2024 , url=

2024
[9]

2024 , eprint=

Exploring Multilingual Unseen Speaker Emotion Recognition: Leveraging Co-Attention Cues in Multitask Learning , author=. 2024 , eprint=

2024
[10]

2022 , eprint=

No Language Left Behind: Scaling Human-Centered Machine Translation , author=. 2022 , eprint=

2022
[11]

Fine-grained Gender Control in Machine Translation with Large Language Models

Lee, Minwoo and Koh, Hyukhun and Kim, Minsung and Jung, Kyomin. Fine-grained Gender Control in Machine Translation with Large Language Models. Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers). 2024. doi:10.18653/v1/2024.naacl-long.303

work page doi:10.18653/v1/2024.naacl-long.303 2024
[12]

2024 , eprint=

What an Elegant Bridge: Multilingual LLMs are Biased Similarly in Different Languages , author=. 2024 , eprint=

2024
[13]

A case study of natural gender phenomena in translation: A comparison of google translate, bing microsoft translator and deepl for english to Italian, French and Spanish

Rescigno, Argentina Anna and Eva Vanmassenhove and Johanna Monti and Andy Way. A case study of natural gender phenomena in translation: A comparison of google translate, bing microsoft translator and deepl for english to Italian, French and Spanish. CEUR Workshop Proceedings. 2020

2020
[14]

2025 , eprint=

Large Language Models and Arabic Content: A Review , author=. 2025 , eprint=

2025
[15]

2019 , eprint=

How does Grammatical Gender Affect Noun Representations in Gender-Marking Languages? , author=. 2019 , eprint=

2019
[16]

2024 , eprint=

Women Are Beautiful, Men Are Leaders: Gender Stereotypes in Machine Translation and Language Modeling , author=. 2024 , eprint=

2024
[17]

2020 , eprint=

Gender in Danger? Evaluating Speech Translation Technology on the MuST-SHE Corpus , author=. 2020 , eprint=

2020
[18]

MT - G en E val: A Counterfactual and Contextual Dataset for Evaluating Gender Accuracy in Machine Translation

Currey, Anna and Nadejde, Maria and Pappagari, Raghavendra Reddy and Mayer, Mia and Lauly, Stanislas and Niu, Xing and Hsu, Benjamin and Dinu, Georgiana. MT - G en E val: A Counterfactual and Contextual Dataset for Evaluating Gender Accuracy in Machine Translation. Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing. 202...

work page doi:10.18653/v1/2022.emnlp-main.288 2022
[19]

2025 , eprint=

mGeNTE: A Multilingual Resource for Gender-Neutral Language and Translation , author=. 2025 , eprint=

2025
[20]

2024 , eprint=

Akal Badi ya Bias: An Exploratory Study of Gender Bias in Hindi Language Technology , author=. 2024 , eprint=

2024
[21]

2024 , eprint=

'Since Lawyers are Males..': Examining Implicit Gender Bias in Hindi Language Generation by LLMs , author=. 2024 , eprint=

2024
[22]

Koehn , title =

P. Koehn , title =. ACL Anthology , pages =. 2005 , url =

2005
[23]

2025 , eprint=

GenderCARE: A Comprehensive Framework for Assessing and Reducing Gender Bias in Large Language Models , author=. 2025 , eprint=

2025
[24]

2024 , eprint=

The power of Prompts: Evaluating and Mitigating Gender Bias in MT with LLMs , author=. 2024 , eprint=

2024
[25]

arXiv preprint arXiv:2409.13484 , year=

'Since Lawyers are Males..': Examining Implicit Gender Bias in Hindi Language Generation by LLMs , author=. arXiv preprint arXiv:2409.13484 , year=

work page arXiv
[26]

2024 , eprint=

Enhancing Gender-Inclusive Machine Translation with Neomorphemes and Large Language Models , author=. 2024 , eprint=

2024
[27]

arXiv preprint arXiv:2403.00277 , year=

Gender bias in large language models across multiple languages , author=. arXiv preprint arXiv:2403.00277 , year=

work page arXiv
[28]

2025 , eprint=

The Bitter Lesson Learned from 2,000+ Multilingual Benchmarks , author=. 2025 , eprint=

2025
[29]

arXiv preprint arXiv:2503.20302 , year=

A Multilingual, Culture-First Approach to Addressing Misgendering in LLM Applications , author=. arXiv preprint arXiv:2503.20302 , year=

work page arXiv
[30]

2025 , eprint=

MM-Eval: A Multilingual Meta-Evaluation Benchmark for LLM-as-a-Judge and Reward Models , author=. 2025 , eprint=

2025
[31]

IndicGenBench : A multilingual benchmark to evaluate generation capabilities of LLM s on indic languages

Singh, Harman and Gupta, Nitish and Bharadwaj, Shikhar and Tewari, Dinesh and Talukdar, Partha. I ndic G en B ench: A Multilingual Benchmark to Evaluate Generation Capabilities of LLM s on I ndic Languages. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2024. doi:10.18653/v1/2024.acl-long.595

work page doi:10.18653/v1/2024.acl-long.595 2024
[32]

2024 , eprint=

Multilingual Large Language Model: A Survey of Resources, Taxonomy and Frontiers , author=. 2024 , eprint=

2024
[33]

2025 , eprint=

BenchMAX: A Comprehensive Multilingual Evaluation Suite for Large Language Models , author=. 2025 , eprint=

2025
[34]

Global mmlu: Understanding and addressing cultural and linguistic biases in multilingual evaluation, 2025

Global mmlu: Understanding and addressing cultural and linguistic biases in multilingual evaluation , author=. arXiv preprint arXiv:2412.03304 , year=

work page arXiv
[35]

International conference on machine learning , pages=

Xtreme: A massively multilingual multi-task benchmark for evaluating cross-lingual generalisation , author=. International conference on machine learning , pages=. 2020 , organization=

2020
[36]

2025 , eprint=

A Survey on Large Language Models with Multilingualism: Recent Advances and New Frontiers , author=. 2025 , eprint=

2025
[37]

A survey on multilingual large language models: corpora, alignment, and bias , volume=

Xu, Yuemei and Hu, Ling and Zhao, Jiayi and Qiu, Zihan and Xu, Kexin and Ye, Yuqi and Gu, Hanwen , year=. A survey on multilingual large language models: corpora, alignment, and bias , volume=. Frontiers of Computer Science , publisher=. doi:10.1007/s11704-024-40579-4 , number=

work page doi:10.1007/s11704-024-40579-4
[38]

2023 , eprint=

Advancements in Scientific Controllable Text Generation Methods , author=. 2023 , eprint=

2023
[39]

No Language Left Behind: Scaling Human-Centered Machine Translation

No language left behind: Scaling human-centered machine translation , author=. arXiv preprint arXiv:2207.04672 , year=

work page internal anchor Pith review arXiv
[40]

GPT-4o System Card

Gpt-4o system card , author=. arXiv preprint arXiv:2410.21276 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[41]

arXiv preprint arXiv:2305.16307 , year=

Indictrans2: Towards high-quality and accessible machine translation models for all 22 scheduled indian languages , author=. arXiv preprint arXiv:2305.16307 , year=

work page arXiv
[42]

Qwen3 Technical Report

Qwen3 Technical Report , author=. arXiv preprint arXiv:2505.09388 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[43]

Gemma 3 Technical Report

Gemma 3 technical report , author=. arXiv preprint arXiv:2503.19786 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[44]

Gemma 2: Improving Open Language Models at a Practical Size

Gemma 2: Improving open language models at a practical size , author=. arXiv preprint arXiv:2408.00118 , year=

work page internal anchor Pith review arXiv
[45]

Phi-4 Technical Report

Phi-4 technical report , author=. arXiv preprint arXiv:2412.08905 , year=

work page internal anchor Pith review arXiv
[46]

A., and Zettlemoyer, L

Stanovsky, Gabriel and Smith, Noah A. and Zettlemoyer, Luke. Evaluating Gender Bias in Machine Translation. Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. 2019. doi:10.18653/v1/P19-1164

work page doi:10.18653/v1/p19-1164 2019
[47]

arXiv preprint arXiv:2410.15037 , year=

mHumanEval--A Multilingual Benchmark to Evaluate Large Language Models for Code Generation , author=. arXiv preprint arXiv:2410.15037 , year=

work page arXiv
[48]

2025 , eprint=

GlotEval: A Test Suite for Massively Multilingual Evaluation of Large Language Models , author=. 2025 , eprint=

2025
[49]

Proceedings of the 13th International Conference on Language Resources and Evaluation (LREC) , year =

The Arabic Parallel Gender Corpus 2.0: Extensions and Analyses , author =. Proceedings of the 13th International Conference on Language Resources and Evaluation (LREC) , year =
[50]

Proceedings of the 10th International Conference on Language Resources and Evaluation (LREC) , year =

OpenSubtitles2016: Extracting Large Parallel Corpora from Movie and TV Subtitles , author =. Proceedings of the 10th International Conference on Language Resources and Evaluation (LREC) , year =
[51]

I'm sorry to hear that

"I'm sorry to hear that": Finding New Biases in Language Models with a Holistic Descriptor Dataset , author =. Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (ACL) , year =