When Cultures Meet: Multicultural Text-to-Image Generation

Brian Trinh; Mounika Yalamarty; Oana Ignat; Parth Bhalerao

arxiv: 2502.15972 · v2 · submitted 2025-02-21 · 💻 cs.CV · cs.AI

When Cultures Meet: Multicultural Text-to-Image Generation

Parth Bhalerao , Mounika Yalamarty , Brian Trinh , Oana Ignat This is my paper

Pith reviewed 2026-05-23 01:58 UTC · model grok-4.3

classification 💻 cs.CV cs.AI

keywords text-to-image generationmulticultural scenesbenchmark datasetmulti-agent frameworkcultural personasimage qualityfairness disparitiesprompt composition

0 comments

The pith

Richer prompts composed by multiple cultural agents improve image quality and cultural accuracy in text-to-image generation for mixed scenes.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Text-to-image models handle uniform cultural settings well but remain largely untested on scenes that combine people and landmarks from different cultures. The paper creates the first benchmark for this setting, a set of 9,000 images that vary by country, age, gender, landmark, and language, and uses it to measure how current models perform on alignment, quality, aesthetics, knowledge, and fairness. It also introduces a multi-agent method in which separate language models adopt distinct cultural personas to build more detailed prompts. This richer composition raises image quality and cultural grounding relative to simple prompts, yet the same tests uncover large performance differences across languages and demographic groups.

Core claim

The paper establishes multicultural text-to-image generation as a distinct task and demonstrates through its 9,000-image benchmark that state-of-the-art models exhibit disparities in alignment, image quality, aesthetics, knowledge, and fairness across cultures, languages, and demographics. It shows that a multi-agent framework using LLMs with distinct cultural personas can produce richer prompts that improve both image quality and cultural grounding compared with simple prompts.

What carries the argument

MosAIG, the multi-agent framework that assigns distinct cultural personas to separate LLMs so they collaborate on composing richer prompts for image generation.

If this is right

Richer prompt composition raises image quality relative to simple prompts.
Richer prompt composition raises cultural grounding relative to simple prompts.
Current models display substantial performance disparities across the five languages in the benchmark.
Current models display substantial performance disparities across the age and gender groups in the benchmark.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The benchmark could serve as a testbed for training data that includes more balanced cultural representation.
The multi-agent persona approach could be tested on related generation tasks such as video or 3D content that mix cultural elements.
Repeating the evaluation on additional languages and countries would show whether the observed disparities generalize or remain specific to the chosen set.

Load-bearing premise

The 9,000-image benchmark and the five chosen evaluation dimensions capture the main challenges of multicultural generation without selection bias in country, landmark, demographic, or language coverage.

What would settle it

A head-to-head comparison of simple prompts versus MosAIG-style richer prompts on a fresh collection of multicultural scenes drawn from countries and languages outside the original benchmark, scored on the same quality and grounding metrics.

Figures

Figures reproduced from arXiv: 2502.15972 by Brian Trinh, Mounika Yalamarty, Oana Ignat, Parth Bhalerao.

**Figure 2.** Figure 2: Overview of MosAIG, our framework for MultiAgent Image Generation. The framework includes a multi-agent interaction model that generates an image caption from demographic information (person age, gender, country, landmark, and caption language), which is then used by an image generation model to create a multicultural image of a landmark and a person. The Landmark Agent describes the landmark architectu… view at source ↗

**Figure 3.** Figure 3: Our multi-agent models (Alt-En-M and Flux [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 4.** Figure 4: Ablation studies on (a) person age, (b) person gender, (c) person country, (d) landmark country, (e) caption language using the best overall model, the Multi-agent English Flux-M (a-d) and Multi-agent Multilingual Alt-M (e). Performance across all five metrics—Alignment, Aesthetic, Quality, Knowledge, and Fairness—reveals significant variation across these demographic categories. b) Person Gender [PITH_FU… view at source ↗

**Figure 5.** Figure 5: Alignment scores with the best overall model, [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗

**Figure 6.** Figure 6: Alignment scores with the best overall multi [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗

**Figure 7.** Figure 7: Comparison of generated images and captions using our multi-agent framework ( [PITH_FULL_IMAGE:figures/full_fig_p008_7.png] view at source ↗

**Figure 8.** Figure 8: Our Multi-agent Framework Prompts Aspect Definition Alignment Is the image semantically correct given the text (text-image alignment)? Quality Do the generated images look like real photographs? Aesthetic Is the image aesthetically pleasing? Fairness Does the model exhibit performance disparities across social groups (e.g., gender, dialect) Knowledge Does the model have knowledge about the world or domains… view at source ↗

**Figure 9.** Figure 9: Human Annotation Interface for manually evaluating the images across all models. [PITH_FULL_IMAGE:figures/full_fig_p014_9.png] view at source ↗

**Figure 10.** Figure 10: English vs. Multilingual Performance. Mod [PITH_FULL_IMAGE:figures/full_fig_p017_10.png] view at source ↗

**Figure 11.** Figure 11: Comparison of generated images and captions using our multi-agent framework ( [PITH_FULL_IMAGE:figures/full_fig_p022_11.png] view at source ↗

**Figure 12.** Figure 12: Comparison of generated images and captions using our multi-agent framework ( [PITH_FULL_IMAGE:figures/full_fig_p022_12.png] view at source ↗

**Figure 13.** Figure 13: Comparison of generated images and captions using our multi-agent framework ( [PITH_FULL_IMAGE:figures/full_fig_p022_13.png] view at source ↗

read the original abstract

Text-to-image generation models have achieved strong performance in culturally homogeneous settings, yet their ability to generate multicultural scenes, where people and landmarks originate from different cultures, remains largely unexplored. We introduce multicultural text-to-image generation as a new task and present the first benchmark designed to study this setting. Our dataset contains 9,000 images spanning five countries, three age groups, two genders, 25 historical landmarks, and five languages. Using this benchmark, we analyze the behavior of state-of-the-art text-to-image models across multiple dimensions, including alignment, image quality, aesthetics, knowledge, and fairness. As one strategy for composing cultural and demographic information, we explore MosAIG, a Multi-Agent framework that enhances multicultural Image Generation by leveraging LLMs with distinct cultural personas. Our analysis shows that richer prompt composition can improve image quality and cultural grounding compared to simple prompts, while revealing substantial disparities across languages and demographic groups. We release our dataset and code at https://github.com/AIM-SCU/MosAIG.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This paper defines multicultural text-to-image as a new task and releases a 9k-image benchmark plus a multi-agent prompting method, but the abstract gives almost no numbers or method details to judge the results.

read the letter

The core contribution is straightforward: the authors treat multicultural scenes in text-to-image models as a distinct setting, build a benchmark spanning five countries, three age groups, two genders, 25 landmarks and five languages, and test whether a multi-agent LLM setup (MosAIG) with cultural personas produces better prompts than simple ones. They also release the data and code, which is the most immediately usable part of the work. Their high-level claim that richer prompt composition improves quality and grounding while models still show clear gaps across languages and demographics is plausible and worth checking against the actual images and metrics once the full paper is read. The public resource lowers the barrier for follow-up experiments on representation issues. The main limitation visible from the abstract is the complete absence of quantitative results, tables, or descriptions of how alignment, aesthetics, knowledge, or fairness were scored. Without those, it is impossible to tell whether the reported improvements are large enough to matter or whether the chosen evaluation dimensions actually capture the hardest multicultural cases. The benchmark coverage itself could carry selection effects in the countries or landmarks picked, but that cannot be assessed yet. This is useful for people already working on bias and cultural fairness in generative models; a reader who wants a ready dataset to run their own tests or to compare prompting strategies will get concrete value. The task definition and data release are solid enough that the paper should go to peer review rather than a desk reject, even if the current version needs the results section expanded before acceptance.

Referee Report

3 major / 0 minor

Summary. The paper introduces multicultural text-to-image generation as a new task and releases the first benchmark of 9,000 images spanning five countries, three age groups, two genders, 25 historical landmarks, and five languages. It evaluates state-of-the-art T2I models on alignment, image quality, aesthetics, knowledge, and fairness, and proposes MosAIG, a multi-agent LLM framework that assigns distinct cultural personas to improve prompt composition. The central observational claim is that richer prompts yield better image quality and cultural grounding than simple prompts while exposing substantial disparities across languages and demographic groups; the dataset and code are released.

Significance. If the benchmark proves representative and the quantitative results hold, the work addresses a clear gap in evaluating T2I models for multicultural scenes and supplies a concrete mitigation strategy via persona-based prompting. Public release of the 9,000-image dataset and code is a concrete strength that enables reproducibility and follow-on research.

major comments (3)

[Abstract] Abstract: the claims that 'richer prompt composition can improve image quality and cultural grounding' and that 'substantial disparities' exist are presented without any quantitative results, error bars, metric definitions, or statistical tests, so it is impossible to verify that the data support the conclusions.
[Abstract] The benchmark construction (five countries, 25 landmarks, five languages) is load-bearing for the disparity claims, yet no justification is given for the selection criteria or coverage of cultural diversity; without this, selection bias cannot be ruled out.
[Abstract] The five evaluation dimensions (alignment, quality, aesthetics, knowledge, fairness) are listed but no operational definitions, annotation protocols, or inter-annotator agreement figures are supplied, undermining the ability to assess metric validity.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the thoughtful comments on the abstract. We address each point below and will revise the manuscript to improve clarity and support for the claims while preserving the original contributions.

read point-by-point responses

Referee: [Abstract] Abstract: the claims that 'richer prompt composition can improve image quality and cultural grounding' and that 'substantial disparities' exist are presented without any quantitative results, error bars, metric definitions, or statistical tests, so it is impossible to verify that the data support the conclusions.

Authors: The abstract is intended as a concise summary; the full paper reports quantitative results with standard deviations, error bars, and statistical tests in Sections 4 and 5 (e.g., alignment scores, aesthetic metrics, and disparity analyses across languages and demographics). We agree the abstract would benefit from key numerical highlights and will revise it to include representative quantitative findings and metric references. revision: yes
Referee: [Abstract] The benchmark construction (five countries, 25 landmarks, five languages) is load-bearing for the disparity claims, yet no justification is given for the selection criteria or coverage of cultural diversity; without this, selection bias cannot be ruled out.

Authors: The five countries were selected to span distinct cultural regions (East Asia, South Asia, Europe, Middle East, Latin America), the 25 landmarks chosen for historical significance and visual recognizability across cultures, and languages for global speaker coverage. We acknowledge the need for explicit justification in the abstract and will add a brief statement on selection criteria and diversity rationale. revision: yes
Referee: [Abstract] The five evaluation dimensions (alignment, quality, aesthetics, knowledge, fairness) are listed but no operational definitions, annotation protocols, or inter-annotator agreement figures are supplied, undermining the ability to assess metric validity.

Authors: Detailed operational definitions, annotation protocols, and inter-annotator agreement are provided in Section 3 and the appendix. We will revise the abstract to include concise operational definitions or explicit references to these sections for improved self-containment. revision: yes

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper introduces an independent benchmark dataset and the MosAIG multi-agent framework as new contributions for the multicultural text-to-image task. It performs observational analysis across alignment, quality, aesthetics, knowledge, and fairness metrics on the 9,000-image set without any equations, parameter fitting, predictions derived from inputs, or load-bearing self-citations that reduce the central claims to prior work by the same authors. The claims about richer prompt composition and disparities are presented as empirical observations from the new benchmark rather than derivations that collapse by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The work is empirical benchmark creation and evaluation; no free parameters, mathematical axioms, or invented entities are introduced or required beyond standard ML evaluation assumptions.

pith-pipeline@v0.9.0 · 5716 in / 1147 out tokens · 69437 ms · 2026-05-23T01:58:05.818402+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

17 extracted references · 17 canonical work pages · 1 internal anchor

[1]

Unsupervised Cross-lingual Representation Learning at Scale

The Age of Migration: International Popu- lation Movements in the Modern World. Jacob Cohen. 1968. Weighted kappa: Nominal scale agreement provision for scaled disagreement or par- tial credit. Psychological Bulletin, 70:213–220. A Conneau. 2019. Unsupervised cross-lingual rep- resentation learning at scale. arXiv preprint arXiv:1911.02116. Javier Martín ...

work page internal anchor Pith review Pith/arXiv arXiv 1968
[2]

arXiv preprint arXiv:2403.11821

Evaluating text to image synthesis: Survey and taxonomy of image quality metrics. arXiv preprint arXiv:2403.11821. Daniel Hershcovich, Stella Frank, Heather Lent, Miryam de Lhoneux, Mostafa Abdou, Stephanie Brandl, Emanuele Bugliarello, Laura Cabello Pi- queras, Ilias Chalkidis, Ruixiang Cui, Constanza Fierro, Katerina Margatina, Phillip Rust, and Anders ...

work page arXiv 2022
[3]

In Proceedings of the IEEE/CVF International Confer- ence on Computer Vision, pages 20406–20417

Tifa: Accurate and interpretable text-to-image faithfulness evaluation with question answering. In Proceedings of the IEEE/CVF International Confer- ence on Computer Vision, pages 20406–20417. Nithish Kannen, Arif Ahmad, Marco Andreetto, Vinod- kumar Prabhakaran, Utsav Prabhu, Adji Bousso Di- eng, Pushpak Bhattacharyya, and Shachi Dave. 2024. Beyond aesth...

work page arXiv 2024
[4]

In ECAI 2024, pages 930–937

On the cultural gap in text-to-image generation. In ECAI 2024, pages 930–937. IOS Press. Rada Mihalcea, Oana Ignat, Longju Bai, Angana Borah, Luis Chiruzzo, Zhijing Jin, Claude Kwizera, Joan Nwatu, Soujanya Poria, and Thamar Solorio. 2024. Why ai is weird and should not be this way: Towards ai for everyone, with everyone, by everyone. arXiv preprint arXiv...

work page arXiv 2024
[5]

Does not match at all

work page
[6]

Has significant discrepancies

work page
[7]

Has several minor discrepancies

work page
[8]

Has a few minor discrepancies

work page
[9]

We ask the annotators to rate how pho- torealistic the generated images are

Matches exactly Quality. We ask the annotators to rate how pho- torealistic the generated images are. Determine if the following image is AI- generated or real

work page
[10]

Probably an AI-generated photo, but photore- alistic

work page
[11]

Probably a real photo, but with irregular tex- tures and shapes

work page
[12]

Real photo. Age Gender Country Landmark Child/ Adult/ Elder Female/Male Germany Cologne Cathedral Reichstag Building Neuschwanstein Castle Brandenburg Gate Holocaust Memorial India Taj Mahal Lotus Temple Gateway of India India Gate Charminar Spain Sagrada Familia Alhambra Guggenheim Museum Roman Theater of Cartagena Royal Palace of Madrid U.S. White House...

work page
[13]

I find the image ugly

work page
[14]

The image has a lot of flaws, but it’s not com- pletely unappealing

work page
[15]

I find the image neither ugly nor aesthetically pleasing

work page
[16]

The image is aesthetically pleasing and is nice to look at

work page
[17]

Summarizer

The image is aesthetically stunning. I can look at it all day. E Results E.1 Across Metrics and Demographics, across All Models PromptAgent RoleConv. Round Moderator <image> SYSTEM: You are a {moderator.role}, who is tasked to generate questions based on an image. USER: Given the image, first, try to find as much as different objects in the image as you c...

work page

[1] [1]

Unsupervised Cross-lingual Representation Learning at Scale

The Age of Migration: International Popu- lation Movements in the Modern World. Jacob Cohen. 1968. Weighted kappa: Nominal scale agreement provision for scaled disagreement or par- tial credit. Psychological Bulletin, 70:213–220. A Conneau. 2019. Unsupervised cross-lingual rep- resentation learning at scale. arXiv preprint arXiv:1911.02116. Javier Martín ...

work page internal anchor Pith review Pith/arXiv arXiv 1968

[2] [2]

arXiv preprint arXiv:2403.11821

Evaluating text to image synthesis: Survey and taxonomy of image quality metrics. arXiv preprint arXiv:2403.11821. Daniel Hershcovich, Stella Frank, Heather Lent, Miryam de Lhoneux, Mostafa Abdou, Stephanie Brandl, Emanuele Bugliarello, Laura Cabello Pi- queras, Ilias Chalkidis, Ruixiang Cui, Constanza Fierro, Katerina Margatina, Phillip Rust, and Anders ...

work page arXiv 2022

[3] [3]

In Proceedings of the IEEE/CVF International Confer- ence on Computer Vision, pages 20406–20417

Tifa: Accurate and interpretable text-to-image faithfulness evaluation with question answering. In Proceedings of the IEEE/CVF International Confer- ence on Computer Vision, pages 20406–20417. Nithish Kannen, Arif Ahmad, Marco Andreetto, Vinod- kumar Prabhakaran, Utsav Prabhu, Adji Bousso Di- eng, Pushpak Bhattacharyya, and Shachi Dave. 2024. Beyond aesth...

work page arXiv 2024

[4] [4]

In ECAI 2024, pages 930–937

On the cultural gap in text-to-image generation. In ECAI 2024, pages 930–937. IOS Press. Rada Mihalcea, Oana Ignat, Longju Bai, Angana Borah, Luis Chiruzzo, Zhijing Jin, Claude Kwizera, Joan Nwatu, Soujanya Poria, and Thamar Solorio. 2024. Why ai is weird and should not be this way: Towards ai for everyone, with everyone, by everyone. arXiv preprint arXiv...

work page arXiv 2024

[5] [5]

Does not match at all

work page

[6] [6]

Has significant discrepancies

work page

[7] [7]

Has several minor discrepancies

work page

[8] [8]

Has a few minor discrepancies

work page

[9] [9]

We ask the annotators to rate how pho- torealistic the generated images are

Matches exactly Quality. We ask the annotators to rate how pho- torealistic the generated images are. Determine if the following image is AI- generated or real

work page

[10] [10]

Probably an AI-generated photo, but photore- alistic

work page

[11] [11]

Probably a real photo, but with irregular tex- tures and shapes

work page

[12] [12]

Real photo. Age Gender Country Landmark Child/ Adult/ Elder Female/Male Germany Cologne Cathedral Reichstag Building Neuschwanstein Castle Brandenburg Gate Holocaust Memorial India Taj Mahal Lotus Temple Gateway of India India Gate Charminar Spain Sagrada Familia Alhambra Guggenheim Museum Roman Theater of Cartagena Royal Palace of Madrid U.S. White House...

work page

[13] [13]

I find the image ugly

work page

[14] [14]

The image has a lot of flaws, but it’s not com- pletely unappealing

work page

[15] [15]

I find the image neither ugly nor aesthetically pleasing

work page

[16] [16]

The image is aesthetically pleasing and is nice to look at

work page

[17] [17]

Summarizer

The image is aesthetically stunning. I can look at it all day. E Results E.1 Across Metrics and Demographics, across All Models PromptAgent RoleConv. Round Moderator <image> SYSTEM: You are a {moderator.role}, who is tasked to generate questions based on an image. USER: Given the image, first, try to find as much as different objects in the image as you c...

work page