pith. sign in

arxiv: 2604.02406 · v2 · pith:7PEN3JAEnew · submitted 2026-04-02 · 💻 cs.CY

Evaluating AI-Generated Images of Cultural Artifacts with Community-Informed Rubrics

Pith reviewed 2026-05-21 10:31 UTC · model grok-4.3

classification 💻 cs.CY
keywords cultural appropriatenessAI-generated imagescommunity rubricstext-to-image modelscultural artifactsevaluation measures
0
0 comments X

The pith

Community input systematizes cultural appropriateness to create rubrics for evaluating AI images of artifacts.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper explores involving communities in the first stage of measuring cultural appropriateness in text-to-image AI outputs. Through case studies with blind and low-vision people in the UK plus residents of Kerala and Tamil Nadu, it turns lived experiences into precise definitions of how artifacts should appear. These definitions support rubrics that can be applied automatically across models while retaining community expertise. The work tests whether concentrating engagement early resolves the tension between repeatable measurement and capturing real perspectives.

Core claim

Systematized concepts of cultural appropriateness developed with community members reflect their lived experiences with each artifact and their preferences for depictions of material culture, showing that community involvement at the definition stage produces valid measures for AI evaluation.

What carries the argument

The staged measurement process that places community systematization of cultural appropriateness before operationalization into rubrics and automated application.

Load-bearing premise

Perspectives from community engagement in the initial definition stage remain effective when converted into standardized rubrics for automatic use across many images and models.

What would settle it

Community members scoring the same AI-generated images with the new rubrics would reveal whether the scores match their independent views of cultural appropriateness.

Figures

Figures reproduced from arXiv: 2604.02406 by Anja Thieme, Cecily Morrison, Daniela Massiceti, Deepthi Sudharsan, Hamna, Hoda Heidari, Jennifer Wortman Vaughan, Nari Johnson, Samantha Dalal, Theo Holroyd.

Figure 1
Figure 1. Figure 1: Scaffolding community engagement to develop community-centered measures of cultural representation. Given an input prompt (e.g., “a photo of a guide cane”), we invited community members to participate in designing a rubric that captures their expertise and preferences for each cultural artifact (systematization). Our research team then explored the use of this rubric within an automated multimodal LLM-as-a… view at source ↗
Figure 2
Figure 2. Figure 2: Measurement framework from the social sciences [1, 116]. Our research studies how to center community knowledge in the systematization process before operationalizing the systematized concept as an automated MLLM-as-a-judge system. Applying Measurement Theory to AI Evaluations. Recent work by Wallach et al. [116] advocates for researchers to rethink how they evaluate generative AI systems, drawing on the t… view at source ↗
Figure 3
Figure 3. Figure 3: Selected culturally significant artifacts. From right to left: (1) With the blind and low vision community, we selected a guide cane (a mobility aid that is held diagonally across one’s body) and a braille notetaker (an electronic device that can be used to read and write notes in tactile braille). (2) With residents of Tamil Nadu, we selected Pallanguzhi (a two-player mancala game where players compete to… view at source ↗
Figure 4
Figure 4. Figure 4: A rubric to score images of a guide cane, designed with BLV community members. Criteria that correspond to visual features in images are organized under two themes that describe participants’ desires for cultural representation. model for its demonstrated performance on MLLM-as-a-judge tasks [130] and report all results averaged over five random seeds. We provide additional details in Appendix B.3.3. To va… view at source ↗
Figure 5
Figure 5. Figure 5: Human-MLLM judge alignment for individual rubric criteria. A histogram that shows the human-MLLM agreement rate for individual rubric criteria. We find that there is high variance across criteria in the MLLM’s ability to annotate a criterion accurately, such as the example criterion on the left, where GPT 4-o has low accuracy (agreement rate 0.46) at annotating whether a drum’s head is made of the correct … view at source ↗
Figure 6
Figure 6. Figure 6: Community-elicited rubrics differ meaningfully from those generated by off-the-shelf LLMs. Our rubrics differ from LLM-generated rubrics in three ways, each illustrated using an example (Appendix B.1.3). First, LLM-generated rubrics can include factual or interpretive errors that reflect misunderstandings of the artifact (e.g., whether a braille notetaker should have a screen). Second, our rubrics provide … view at source ↗
Figure 7
Figure 7. Figure 7: Annotated LLM-generated rubric for a guide cane. While generally providing an accurate description of a guide cane, the rubric misses several key details. The rubric does not provide a complete description of the straight handle shape of a cane (C2), a feature that is of critical importance to the community. In workshops, we learned that a band of red tape on a cane’s body is often a visual signifier that … view at source ↗
Figure 8
Figure 8. Figure 8: Annotated LLM-generated rubric for a braille notetaker. The rubric criteria include both inaccurate descriptions of a braille notetaking device, and do not include descriptive details about valid depictions of braille. A braille notetaker does not resemble a notepad (e.g., it does not include a writing device such as a pen), and instead resembles a slim rectangular box (C1). The rubric does not provide a d… view at source ↗
Figure 9
Figure 9. Figure 9: Annotated LLM-generated rubric for Pallanguzhi. The rubric generally provides an accurate description of the most important characteristics of the a Pallanguzhi board, with two differences from the community-elicited rubric. (C1) Community members clarified that the color of the wood is important, and that Pallanguzhi baords are traditionally made of a deep-brown teakwood. (C2) The number of pits in each r… view at source ↗
Figure 10
Figure 10. Figure 10: Annotated LLM-generated rubric for a Mridangam. The rubric lacks many of the critical details that distinguish the Mridangam from related drums and percussion instruments. One significant omission is the black circular membrane that must be present on both drumheads, a key feature that contributes to the timbre of the drum (C5). One drumhead is often slightly larger than the other (C2). The Mridangam shou… view at source ↗
Figure 11
Figure 11. Figure 11: Annotated LLM-generated rubric for a Kasavu saree. The rubric generally provides an accurate description of a Kasavu saree, demonstrating substantial overlap with the community-elicited rubric. However, community members were clear that the material must be cotton and not silk (C4). C1: The depiction shows a long and narrow wooden boat traditionally used in Kerala, India. C2: The image includes details su… view at source ↗
Figure 12
Figure 12. Figure 12: Annotated LLM-generated rubric for Chundan Vallam. The rubric criteria cover the general structure of the Chundan Vallam but do not specify its defining features. In particular, they omit details about the oar structure and handling (C2); community members specified that the oars should be long, angled downward toward the water, and that each oarsman must use a single oar. The rubrics also do not specify … view at source ↗
Figure 13
Figure 13. Figure 13: Criterion-level annotations provided by humans reveal the specific representational errors that make depictions of a braille notetaker inappropriate. The figure displays a reference photo of a braille notetaker, and example AI-generated images that fall into one of four groups (as annotated by humans): (1) images that are appropriate to show (and no filter-out criteria are met), (2) images that do not mee… view at source ↗
Figure 14
Figure 14. Figure 14: Comparing (manual) rubric application across models for a braille notetaker. The frequency at which different criteria are violated (reported here using annotations provided by humans) varies across different models. For example, the GPT Image-1 images of braille notetakers that are inappropriate to show are all violate Theme 1, Criteria 4 (failing to depict valid braille). In contrast, images generated b… view at source ↗
Figure 15
Figure 15. Figure 15: Comparing (manual) rubric application across models for a Mridangam. Comparing the frequency at which different criteria are violated across models allows practitioners to draw interpretable insights about models’ failure modes. With the exception of GPT Image-1, many of the models (i.e., DALL·E 3, Flux.1 DEV, and Stable Diffusion 3 Medium) consistently fail to meet several criteria, such as failing to de… view at source ↗
Figure 16
Figure 16. Figure 16: Comparing (manual) rubric application across models for a Kasavu Saree. Visualizing the breakdown of criteria that are violated by different image generation models reveals interpretable insights about model behavior. For example, only Flux.1 and the Stable Diffusion models depict the saree with additional unnecessary embellishment (Theme 1, Criteria 4). The GPT Image-1 images consistently depict the sare… view at source ↗
Figure 17
Figure 17. Figure 17: Stable Diffusion 3 generates depictions of unrelated cultural artifacts and scenes when given simple transliterated prompts. Depictions improve when images are generated using DALL·E 3 revised prompts instead. The images on the left were generated with the simple transliterated prompt “A photo of a Chundan Vallam”. Instead of producing depictions of a boat, the generated images show unrelated depictions o… view at source ↗
read the original abstract

Measurement is essential to improving AI performance and mitigating harms for marginalized groups. As generative AI systems are rapidly deployed across geographies and contexts, AI measurement practices must be designed to support repeatable, automatable application across different models, datasets, and evaluation settings. But the drive to automate measurement can be in tension with the ability for measurement instruments to capture the expertise and perspectives of communities impacted by AI. Recent work advocates for breaking measurement into several key stages: first moving from an abstract concept to be measured into a precise, "systematized" concept; next operationalizing the systematized concept into a concrete measurement instrument; and finally applying the measurement instrument on data to produce measurements. This opens up an opportunity to concentrate community engagement in the systematization phase before operationalizing and applying measurement instruments. In this paper, we explore how to involve communities in systematizing the concept of "cultural appropriateness" in text-to-image models' representation of culturally significant artifacts through case studies with three communities: blind and low vision individuals residing in the UK, residents of Kerala, and residents of Tamil Nadu. Our systematized concepts reflect community members' lived experiences interacting with each artifact and how they want their material culture to be depicted, demonstrating the value of community involvement in defining valid measures. We explore how these systematized concepts can be operationalized into automated measurement instruments that could be applied using a multimodal LLM-as-a-judge approach and challenges that remain. We reflect on the benefits and limitations of such approaches.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript explores involving communities in the systematization phase of measuring 'cultural appropriateness' for text-to-image models' depictions of cultural artifacts. Through case studies with blind and low vision individuals in the UK, residents of Kerala, and residents of Tamil Nadu, the authors develop systematized concepts drawn from participants' lived experiences and preferences for how their material culture should be represented. The work then examines operationalizing these concepts into automated instruments via a multimodal LLM-as-a-judge approach and reflects on benefits, limitations, and remaining challenges in achieving repeatable, automatable measurement across models and settings.

Significance. If the community-informed systematized concepts can be faithfully translated into LLM-usable rubrics without substantial loss of nuance or introduction of model-specific biases, the approach would meaningfully advance inclusive AI evaluation practices by reconciling automation needs with community expertise. The multi-community case studies provide concrete grounding for the claim that concentrating engagement in the systematization stage adds validity, and the explicit discussion of operationalization challenges is a constructive contribution to the broader measurement literature.

major comments (2)
  1. [§4] §4 (Operationalization and LLM-as-a-judge): The manuscript notes challenges in translating community criteria into automatable prompts or rubrics but supplies only high-level discussion rather than concrete examples of rubric items derived from specific community input (e.g., desired depictions for Kerala or Tamil Nadu artifacts) and their encoding as LLM scoring criteria. This step is load-bearing for the central claim that the approach enables repeatable, automatable instruments across settings without losing captured expertise.
  2. [§3] §3 (Case studies): The systematized concepts are presented as reflecting lived experiences, yet the text provides limited direct evidence—such as participant quotes, raw response summaries, or side-by-side comparisons of community input versus final systematized statements—to allow readers to evaluate the fidelity of the translation process.
minor comments (2)
  1. The abstract and introduction could more explicitly distinguish the three communities' distinct artifact types and cultural contexts to help readers track how findings generalize.
  2. Notation for the measurement stages (systematization, operationalization, application) is introduced clearly but could be reinforced with a small diagram or table summarizing the pipeline for each case study.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive comments, which identify key areas where additional detail would strengthen the manuscript's demonstration of the proposed approach. We respond to each major comment below, indicating planned revisions.

read point-by-point responses
  1. Referee: [§4] §4 (Operationalization and LLM-as-a-judge): The manuscript notes challenges in translating community criteria into automatable prompts or rubrics but supplies only high-level discussion rather than concrete examples of rubric items derived from specific community input (e.g., desired depictions for Kerala or Tamil Nadu artifacts) and their encoding as LLM scoring criteria. This step is load-bearing for the central claim that the approach enables repeatable, automatable instruments across settings without losing captured expertise.

    Authors: We agree that concrete examples are necessary to support the claim of faithful translation into automatable instruments. In the revised manuscript we will add specific rubric items drawn from the Kerala and Tamil Nadu case studies, including examples of desired depictions (such as accurate rendering of temple architecture or traditional motifs) and their direct encoding as LLM scoring criteria with sample prompts and scales. This will be presented alongside discussion of remaining challenges to avoid overstating generalizability. revision: yes

  2. Referee: [§3] §3 (Case studies): The systematized concepts are presented as reflecting lived experiences, yet the text provides limited direct evidence—such as participant quotes, raw response summaries, or side-by-side comparisons of community input versus final systematized statements—to allow readers to evaluate the fidelity of the translation process.

    Authors: We acknowledge the value of greater transparency in showing the translation process. The revised §3 will incorporate selected participant quotes, summarized raw responses, and side-by-side comparisons between community inputs and the final systematized statements for the three case studies, enabling readers to assess fidelity more directly while respecting participant confidentiality constraints. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper is a qualitative exploration of community involvement in systematizing concepts of cultural appropriateness for AI image generation via case studies. It contains no equations, fitted parameters, predictions, or self-referential derivations that reduce claims to author-defined inputs by construction. The central claims rest on direct community input rather than any load-bearing self-citation chain or renaming of prior results. This is the most common honest non-finding for self-contained qualitative work against external community benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The paper relies on the domain assumption that community perspectives captured through engagement constitute valid and superior input for defining cultural appropriateness; no free parameters or invented entities are introduced.

axioms (1)
  • domain assumption Community members' lived experiences provide the authoritative basis for defining valid measures of cultural appropriateness in AI image generation.
    Invoked when the abstract states that systematized concepts reflect community experiences and demonstrate the value of involvement.

pith-pipeline@v0.9.0 · 5834 in / 1205 out tokens · 46919 ms · 2026-05-21T10:31:43.559900+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

139 extracted references · 139 canonical work pages · 5 internal anchors

  1. [1]

    Robert Adcock and David Collier. 2001. Measurement validity: A shared standard for qualitative and quantitative research.American Political Science Review95, 3 (2001), 529–546

  2. [3]

    Muhammad Farid Adilazuarda, Sagnik Mukherjee, Pradhyumna Lavania, Siddhant Singh, Ashutosh Dwivedi, Alham Fikri Aji, Jacki O’Neill, Ashutosh Modi, and Monojit Choudhury. 2024. Towards Measuring and Modeling “Culture” in LLMs: A Survey.arXiv preprint arXiv:2403.15412 (2024)

  3. [4]

    I look at it as the king of knowledge

    Rudaiba Adnin and Maitraye Das. 2024. "I look at it as the king of knowledge": How Blind People Use and Understand Generative AI Tools. In Proceedings of the 26th International ACM SIGACCESS Conference on Computers and Accessibility(St. John’s, NL, Canada)(ASSETS ’24). Association for Computing Machinery, New York, NY, USA, Article 64, 14 pages. doi:10.11...

  4. [5]

    Afra Feyza Akyürek, Advait Gosai, Chen Bo Calvin Zhang, Vipul Gupta, Jaehwan Jeong, Anisha Gunjal, Tahseen Rabbani, Maria Mazzone, David Randolph, Mohammad Mahmoudi Meymand, Gurshaan Chattha, Paula Rodriguez, Diego Mares, Pavit Singh, Michael Liu, Subodh Chawla, Pete Cline, Lucy Ogaz, Ernesto Hernandez, Zihao Wang, Pavi Bhatter, Marcos Ayestaran, Bing Liu...

  5. [6]

    Arnstein

    Sherry R. Arnstein. 1969. A Ladder of Citizen Participation.Journal of the American Institute of Planners35, 4 (1969), 216–224

  6. [7]

    Taylor, Mark Díaz, Christopher M

    Lora Aroyo, Alex S. Taylor, Mark Díaz, Christopher M. Homan, Alicia Parrish, Greg Serapio-García, Vinodkumar Prabhakaran, and Ding Wang

  7. [8]

    InProceedings of the 37th International Conference on Neural Information Processing Systems(New Orleans, LA, USA)(NIPS ’23)

    DICES dataset: diversity in conversational AI evaluation for safety. InProceedings of the 37th International Conference on Neural Information Processing Systems(New Orleans, LA, USA)(NIPS ’23). Curran Associates Inc., Red Hook, NY, USA, Article 2321, 13 pages

  8. [9]

    Solon Barocas, Anhong Guo, Ece Kamar, Jacquelyn Krones, Meredith Ringel Morris, Jennifer Wortman Vaughan, W Duncan Wadsworth, and Hanna Wallach. 2021. Designing disaggregated evaluations of ai systems: Choices, considerations, and tradeoffs. InProceedings of the 2021 AAAI/ACM Conference on AI, Ethics, and Society. 368–378

  9. [10]

    Bennett, Erin Brady, and Stacy M

    Cynthia L. Bennett, Erin Brady, and Stacy M. Branham. 2018. Interdependence as a Frame for Assistive Technology Research and Design. In Proceedings of the 20th International ACM SIGACCESS Conference on Computers and Accessibility(Galway, Ireland)(ASSETS ’18). Association for Computing Machinery, New York, NY, USA, 161–173. doi:10.1145/3234695.3236348

  10. [11]

    Bennett, Shaun K

    Cynthia L. Bennett, Shaun K. Kane, and Christina N. Harrington. 2025. Toward Community-Led Evaluations of Text-to-Image AI Representations of Disability, Health, and Accessibility. InProceedings of the 5th ACM Conference on Equity and Access in Algorithms, Mechanisms, and Optimization (EAAMO ’25). Association for Computing Machinery, New York, NY, USA, 25...

  11. [12]

    Stevie Bergman, Nahema Marchal, John Mellor, Shakir Mohamed, Iason Gabriel, and William Isaac. 2024. STELA: a community-centred approach to norm elicitation for AI alignment.Scientific Reports14, 1 (2024), 6616

  12. [13]

    Federico Bianchi, Pratyusha Kalluri, Esin Durmus, Faisal Ladhak, Myra Cheng, Debora Nozza, Tatsunori Hashimoto, Dan Jurafsky, James Zou, and Aylin Caliskan. 2023. Easily Accessible Text-to-Image Generation Amplifies Demographic Stereotypes at Large Scale. InProceedings of the 2023 ACM Conference on Fairness, Accountability, and Transparency(Chicago, IL, U...

  13. [14]

    Gray, and Rida Qadri

    Asia Biega, Georgina Born, Fernando Diaz, Mary L. Gray, and Rida Qadri. 2025. Towards a Multidisciplinary Vision for Culturally Inclusive Generative AI (Dagstuhl Seminar 25022).Dagstuhl Reports15, 1 (2025), 33–49. doi:10.4230/DagRep.15.1.33

  14. [15]

    Black Forest Labs. 2024. FLUX. https://github.com/black-forest-labs/flux

  15. [16]

    Janet Blake. 2000. On Defining the Cultural Heritage.International & Comparative Law Quarterly49, 1 (2000), 61–85

  16. [17]

    Bogardus

    Emory S. Bogardus. 1942.Fundamentals of Social Psychology(3 ed.). D. Appleton-Century Company, New York and London

  17. [18]

    Branham and Shaun K

    Stacy M. Branham and Shaun K. Kane. 2015. The Invisible Work of Accessibility: How Blind Employees Manage Accessibility in Mixed-Ability Workplaces. InProceedings of the 17th International ACM SIGACCESS Conference on Computers & Accessibility(Lisbon, Portugal)(ASSETS ’15). Association for Computing Machinery, New York, NY, USA, 163–171. doi:10.1145/270064...

  18. [19]

    Chris Callison-Burch. 2009. Fast, Cheap, and Creative: Evaluating Translation Quality Using Amazon’s Mechanical Turk. InProceedings of the 2009 Conference on Empirical Methods in Natural Language Processing, Philipp Koehn and Rada Mihalcea (Eds.). Association for Computational Linguistics, Singapore, 286–295. https://aclanthology.org/D09-1030/

  19. [20]

    Joseph Chee Chang, Saleema Amershi, and Ece Kamar. 2017. Revolt: Collaborative Crowdsourcing for Labeling Machine Learning Datasets. In Proceedings of the 2017 CHI Conference on Human Factors in Computing Systems(Denver, Colorado, USA)(CHI ’17). Association for Computing Machinery, New York, NY, USA, 2334–2346. doi:10.1145/3025453.3026044

  20. [21]

    Kyla Chasalow and Karen Levy. 2021. Representativeness in statistics, politics, and machine learning. InProceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency. 77–89

  21. [22]

    Khaoula Chehbouni, Mohammed Haddou, Jackie Chi Kit Cheung, and Golnoosh Farnadi. 2025. Neither Valid nor Reliable? Investigating the Use of LLMs as Judges. https://arxiv.org/abs/2508.18076

  22. [23]

    Jiahui Chen, Candace Ross, Reyhane Askari-Hemmat, Koustuv Sinha, Melissa Hall, Michal Drozdzal, and Adriana Romero-Soriano. 2025. Multi- Modal Language Models as Text-to-Image Model Evaluators. https://arxiv.org/abs/2505.00759 Evaluating AI-Generated Images of Cultural Artifacts with Community-Informed Rubrics 19

  23. [24]

    Tim Connell. 2008. The Challenge of Assistive Technology and Braille Literacy. https://www.afb.org/aw/9/1/14277 [Online; accessed 6-September- 2025]

  24. [25]

    Alex Dow, Jean Garcia-Gathright, Nicholas J Pangakis, Emily Sheng, Dan Vann, Matthew Vogel, and Hanna Wallach

    Emily Corvi, Hannah Washington, Stefanie Reed, Chad Atalla, Alexandra Chouldechova, P. Alex Dow, Jean Garcia-Gathright, Nicholas J Pangakis, Emily Sheng, Dan Vann, Matthew Vogel, and Hanna Wallach. 2025. Taxonomizing Representational Harms using Speech Act Theory. InFindings of the Association for Computational Linguistics. doi:10.18653/v1/2025.findings-acl.202

  25. [26]

    Amanda Coston, Anna Kawakami, Haiyi Zhu, Ken Holstein, and Hoda Heidari. 2023. A validity perspective on evaluating the justified use of data-driven decision-making algorithms. In2023 IEEE conference on secure and trustworthy machine learning (SaTML). IEEE, 690–704

  26. [27]

    Lee J Cronbach and Paul E Meehl. 1955. Construct validity in psychological tests.Psychological bulletin52, 4 (1955), 281

  27. [28]

    inclusion

    Samantha Dalal, Siobhan Mackenzie Hall, and Nari Johnson. 2024. Provocation: Who benefits from "inclusion" in Generative AI? https: //arxiv.org/abs/2411.09102

  28. [29]

    Maitraye Das, Alexander J Fiannaca, Meredith Ringel Morris, Shaun K Kane, and Cynthia L Bennett. 2024. From provenance to aberrations: Image creator and screen reader user perspectives on alt text for AI-generated images. InProceedings of the 2024 CHI Conference on Human Factors in Computing Systems. 1–21

  29. [30]

    It doesn’t win you friends

    Maitraye Das, Darren Gergle, and Anne Marie Piper. 2019. "It doesn’t win you friends": Understanding Accessibility in Collaborative Writing for People with Vision Impairments.Proc. ACM Hum.-Comput. Interact.3, CSCW, Article 191 (Nov. 2019), 26 pages. doi:10.1145/3359293

  30. [31]

    Nassim Dehouche and Kullathida Dehouche. 2023. What’s in a text-to-image prompt? The potential of stable diffusion in visual arts education. Heliyon9, 6 (2023), e16757. doi:10.1016/j.heliyon.2023.e16757

  31. [32]

    Fernando Delgado, Stephen Yang, Michael Madaio, and Qian Yang. 2023. The Participatory Turn in AI Design: Theoretical Foundations and the Current State of Practice. https://arxiv.org/abs/2310.00907

  32. [33]

    Sunipa Dev, Vinodkumar Prabhakaran, Rutledge Chin Feman, Aida Davani, Remi Denton, Charu Kalia, Piyawat L Kumjorn, Madhurima Maji, Rida Qadri, Negar Rostamzadeh, Renee Shelby, Romina Stella, Hayk Stepanyan, Erin van Liemt, Aishwarya Verma, Oscar Wahltinez, Edem Wornyo, Andrew Zaldivar, and Saška Mojsilović. 2026. A Unified Framework to Quantify Cultural I...

  33. [34]

    Athiya Deviyani and Fernando Diaz. 2025. Contextual Metric Meta-Evaluation by Measuring Local Metric Accuracy. https://arxiv.org/abs/2503. 19828

  34. [35]

    2025.Exploring Black Communities’ Perceptions and Design Approaches for Building Culturally Tailored AI Systems

    Lisa Egede. 2025.Exploring Black Communities’ Perceptions and Design Approaches for Building Culturally Tailored AI Systems. Association for Computing Machinery, New York, NY, USA, 72–76. https://doi.org/10.1145/3715668.3735629

  35. [36]

    Maria Eriksson, Erasmo Purificato, Arman Noroozian, Joao Vinagre, Guillaume Chaslot, Emilia Gomez, and David Fernandez-Llorca. 2025. Can We Trust AI Benchmarks? An Interdisciplinary Review of Current Issues in AI Evaluation. https://arxiv.org/abs/2502.06559

  36. [37]

    Yannick Exner, Jochen Hartmann, Oded Netzer, and Shunyuan Zhang. 2025. AI in Disguise - How AI-Generated Ads’ Visual Cues Shape Consumer Perception and Performance. doi:10.2139/ssrn.5096969

  37. [38]

    Ali Farhadi, Ian Endres, Derek Hoiem, and David Forsyth. 2009. Describing objects by their attributes. In2009 IEEE Conference on Computer Vision and Pattern Recognition. 1778–1785. doi:10.1109/CVPR.2009.5206772

  38. [39]

    Sanjana Gautam, Pranav Narayanan Venkit, and Sourojit Ghosh. 2024. From melting pots to misrepresentations: Exploring harms in Generative AI. arXiv preprint arXiv:2403.10776(2024)

  39. [40]

    Simret Araya Gebreegziabher, Charles Chiang, Zichu Wang, Zahra Ashktorab, Michelle Brachman, Werner Geyer, Toby Jia-Jun Li, and Diego Gómez-Zará. 2025. MetricMate: An Interactive Tool for Generating Evaluation Criteria for LLM-as-a-Judge Workflow. InProceedings of the 4th Annual Symposium on Human-Computer Interaction for Work (CHIWORK ’25). Association f...

  40. [41]

    Sourojit Ghosh, Pranav Narayanan Venkit, Sanjana Gautam, Shomir Wilson, and Aylin Caliskan. 2024. Do Generative AI Models Output Harm while Representing Non-Western Cultures: Evidence from A Community-Centered Approach.Proceedings of the AAAI/ACM Conference on AI, Ethics, and Society7, 1 (Oct. 2024), 476–489. doi:10.1609/aies.v7i1.31651

  41. [42]

    Tarleton Gillespie. 2024. Generative AI and the politics of visibility.Big Data & Society11, 2 (2024), 20539517241252131. doi:10.1177/ 20539517241252131

  42. [43]

    Luke Guerdan, Solon Barocas, Kenneth Holstein, Hanna Wallach, Zhiwei Steven Wu, and Alexandra Chouldechova. 2025. Validating LLM-as-a-Judge Systems under Rating Indeterminacy. https://arxiv.org/abs/2503.05965

  43. [44]

    Kanika Gupta, Monojit Choudhury, and Kalika Bali. 2012. Mining Hindi-English Transliteration Pairs from Online Hindi Lyrics. InProceedings of the Eighth International Conference on Language Resources and Evaluation (LREC’12), Nicoletta Calzolari, Khalid Choukri, Thierry Declerck, Mehmet Uğur Doğan, Bente Maegaard, Joseph Mariani, Asuncion Moreno, Jan Odij...

  44. [45]

    Rishav Hada, Varun Gumma, Adrian de Wynter, Harshita Diddee, Mohamed Ahmed, Monojit Choudhury, Kalika Bali, and Sunayana Sitaram. 2024. Are Large Language Model-based Evaluators the Solution to Scaling Up Multilingual Evaluation?arXiv preprint arXiv:2309.07462(2024)

  45. [46]

    Bell, Candace Ross, Adina Williams, Michal Drozdzal, and Adriana Romero Soriano

    Melissa Hall, Samuel J. Bell, Candace Ross, Adina Williams, Michal Drozdzal, and Adriana Romero Soriano. 2024. Towards Geographic Inclusion in the Evaluation of Text-to-Image Models. InProceedings of the 2024 ACM Conference on Fairness, Accountability, and Transparency(Rio de Janeiro, Brazil)(FAccT ’24). Association for Computing Machinery, New York, NY, ...

  46. [47]

    1997.Representation: Cultural Representations and Signifying Practices

    Stuart Hall (Ed.). 1997.Representation: Cultural Representations and Signifying Practices. Sage Publications, London. 20 Johnson et al

  47. [48]

    Siobhan Mackenzie Hall, Samantha Dalal, Raesetje Sefala, Foutse Yuehgoh, Aisha Alaagib, Imane Hamzaoui, Shu Ishida, Jabez Magomere, Lauren Crais, Aya Salama, et al. 2025. The Human Labour of Data Work: Capturing Cultural Diversity through World Wide Dishes.arXiv preprint arXiv:2502.05961(2025)

  48. [49]

    Hamna, Gayatri Bhat, Sourabrata Mukherjee, Faisal Lalani, Evan Hadfield, Divya Siddarth, Kalika Bali, and Sunayana Sitaram. 2025. Building Benchmarks from the Ground Up: Community-Centered Evaluation of LLMs in Healthcare Chatbot Settings. https://arxiv.org/abs/2509.24506

  49. [50]

    Hamna, Deepthi Sudharsan, Agrima Seth, Ritvik Budhiraja, Deepika Khullar, Vyshak Jain, Kalika Bali, Aditya Vashistha, and Sameer Segal. 2025. Kahani: Culturally-Nuanced Visual Storytelling Tool for Non-Western Cultures. InProceedings of the 2025 ACM SIGCAS/SIGCHI Conference on Computing and Sustainable Societies (COMPASS ’25). Association for Computing Ma...

  50. [51]

    Emma Harvey, Emily Sheng, Su Lin Blodgett, Alexandra Chouldechova, Jean Garcia-Gathright, Alexandra Olteanu, and Hanna Wallach. 2025. Understanding and Meeting Practitioner Needs When Measuring Representational Harms Caused by LLM-Based Systems. https://arxiv.org/abs/ 2506.04482

  51. [52]

    Helia Hashemi, Jason Eisner, Corby Rosset, Benjamin Van Durme, and Chris Kedzie. 2024. LLM-Rubric: A Multidimensional, Calibrated Approach to Automated Evaluation of Natural Language Texts. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Lun-Wei Ku, Andre Martins, and Vivek Srikumar (Eds.)...

  52. [53]

    Huiguo He, Huan Yang, Zixi Tuo, Yuan Zhou, Qiuyue Wang, Yuhang Zhang, Zeyu Liu, Wenhao Huang, Hongyang Chao, and Jian Yin. 2025. Dreamstory: Open-domain story visualization by llm-guided multi-subject consistent diffusion.IEEE Transactions on Pattern Analysis and Machine Intelligence(2025)

  53. [54]

    Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. 2018. GANs Trained by a Two Time-Scale Update Rule Converge to a Local Nash Equilibrium. https://arxiv.org/abs/1706.08500

  54. [55]

    Rachel Hong, William Agnew, Tadayoshi Kohno, and Jamie Morgenstern. 2024. Who’s in and who’s out? A case study of multimodal CLIP-filtering in DataComp. InProceedings of the 4th ACM Conference on Equity and Access in Algorithms, Mechanisms, and Optimization. 1–17

  55. [56]

    Sandford

    Chien-Chi Hsu and Brian A. Sandford. 2007. The Delphi technique: Making sense of consensus.Practical Assessment, Research, and Evaluation12, 10 (2007), 1–8. https://openpublishing.library.umass.edu/pare/article/id/1418/ A widely cited methodological overview of the Delphi method

  56. [57]

    Yushi Hu, Benlin Liu, Jungo Kasai, Yizhong Wang, Mari Ostendorf, Ranjay Krishna, and Noah A Smith. 2023. TIFA: Accurate and Interpretable Text-to-Image Faithfulness Evaluation with Question Answering. https://arxiv.org/abs/2303.11897

  57. [58]

    Mina Huh, Yi-Hao Peng, and Amy Pavel. 2023. GenAssist: Making image generation accessible. InProceedings of the 36th Annual ACM Symposium on User Interface Software and Technology. 1–17

  58. [59]

    Sadeep Jayasumana, Srikumar Ramalingam, Andreas Veit, Daniel Glasner, Ayan Chakrabarti, and Sanjiv Kumar. 2024. Rethinking FID: Towards a Better Evaluation Metric for Image Generation. https://arxiv.org/abs/2401.09603

  59. [60]

    Akshita Jha, Vinodkumar Prabhakaran, Remi Denton, Sarah Laszlo, Shachi Dave, Rida Qadri, Chandan K Reddy, and Sunipa Dev. 2024. Visage: A global-scale analysis of visual stereotypes in text-to-image generation.arXiv preprint arXiv:2401.06310(2024)

  60. [61]

    Jiang, Lauren Brown, Jessica Cheng, Mehtab Khan, Abhishek Gupta, Deja Workman, Alex Hanna, Johnathan Flowers, and Timnit Gebru

    Harry H. Jiang, Lauren Brown, Jessica Cheng, Mehtab Khan, Abhishek Gupta, Deja Workman, Alex Hanna, Johnathan Flowers, and Timnit Gebru

  61. [62]

    InProceedings of the 2023 AAAI/ACM Conference on AI, Ethics, and Society

    AI Art and its Impact on Artists. InProceedings of the 2023 AAAI/ACM Conference on AI, Ethics, and Society(Montréal, QC, Canada)(AIES ’23). Association for Computing Machinery, New York, NY, USA, 363–374. doi:10.1145/3600211.3604681

  62. [63]

    Nari Johnson, Hamna Abid, Deepthi Sudharsan, Theo Holroyd, Samantha Dalal, Siobhan Mackenzie Hall, Jennifer Wortman Vaughan, Daniela Massiceti, and Cecily Morrison. 2025. Position: To Make Text-to-Image Models that Work for Marginalized Communities, We Need New Measurement Practices for the Long Tail. https://www.microsoft.com/en-us/research/publication/p...

  63. [64]

    Shivani Kapania, Stephanie Ballard, Alex Kessler, and Jennifer Wortman Vaughan. 2025. Examining the Expanding Role of Synthetic Data Throughout the AI Development Pipeline. InProceedings of the 2025 ACM Conference on Fairness, Accountability, and Transparency

  64. [65]

    2025.Translation Tutorial: AI Measurement as a Stakeholder-Engaged Design Practice

    Anna Kawakami, Su Lin Blodgett, Solon Barocas, Alex Chouldechova, Abigail Jacobs, Emily Sheng, Jenn Wortman Vaughan, Hanna Wallach, Amy Winecoff, Angelina Wang, Haiyi Zhu, and Ken Holstein. 2025.Translation Tutorial: AI Measurement as a Stakeholder-Engaged Design Practice. Retrieved January 10, 2026 from https://drive.google.com/file/d/12qQd6ROfacYAtoQ-ii...

  65. [66]

    Anna Kawakami, Jordan Taylor, Sarah Fox, Haiyi Zhu, and Kenneth Holstein. 2026. AI failure loops in devalued work: The confluence of overconfidence in AI and underconfidence in worker expertise.Big Data & Society13, 1 (2026), 20539517261424164. doi:10.1177/20539517261424164

  66. [67]

    Hannah Rose Kirk, Alexander Whitefield, Paul Röttger, Andrew Bean, Katerina Margatina, Juan Ciro, Rafael Mosquera, Max Bartolo, Adina Williams, He He, et al. 2024. The PRISM Alignment Project: What Participatory, Representative and Individualised Human Feedback Reveals About the Subjective and Multicultural Alignment of Large Language Models.arXiv preprin...

  67. [68]

    Yuval Kirstain, Adam Polyak, Uriel Singer, Shahbuland Matiana, Joe Penna, and Omer Levy. 2023. Pick-a-Pic: An Open Dataset of User Preferences for Text-to-Image Generation. https://arxiv.org/abs/2305.01569

  68. [69]

    Kevin Knight and Jonathan Graehl. 1998. Machine Transliteration.Computational Linguistics24, 4 (1998), 599–612. https://aclanthology.org/J98- 4003/

  69. [70]

    Elisa Kreiss, Cynthia Bennett, Shayan Hooshmand, Eric Zelikman, Meredith Ringel Morris, and Christopher Potts. 2022. Context Matters for Image Descriptions for Accessibility: Challenges for Referenceless Evaluation Metrics.arXiv preprint arXiv:2205.10646(2022). Evaluating AI-Generated Images of Cultural Artifacts with Community-Informed Rubrics 21

  70. [71]

    Neha Kumar, Naveena Karusala, Azra Ismail, Marisol Wong-Villacres, and Aditya Vishwanath. 2019. Engaging Feminist Solidarity for Comparative Research, Design, and Practice.Proc. ACM Hum.-Comput. Interact.3, CSCW, Article 167 (Nov. 2019), 24 pages. doi:10.1145/3359269

  71. [72]

    C., Avik Bhattacharyya, Mitesh M

    Anoop Kunchukuttan, Divyanshu Kakwani, Satish Golla, Gokul N. C., Avik Bhattacharyya, Mitesh M. Khapra, and Pratyush Kumar. 2020. AI4Bharat-IndicNLP Corpus: Monolingual Corpora and Word Embeddings for Indic Languages. https://arxiv.org/abs/2005.00085

  72. [73]

    Tony Lee, Michihiro Yasunaga, Chenlin Meng, Yifan Mai, Joon Sung Park, Agrim Gupta, Yunzhi Zhang, Deepak Narayanan, Hannah Benita Teufel, Marco Bellagente, Minguk Kang, Taesung Park, Jure Leskovec, Jun-Yan Zhu, Li Fei-Fei, Jiajun Wu, Stefano Ermon, and Percy Liang. 2023. Holistic Evaluation of Text-To-Image Models. https://arxiv.org/abs/2311.04287

  73. [74]

    Dawei Li, Bohan Jiang, Liangjie Huang, Alimohammad Beigi, Chengshuai Zhao, Zhen Tan, Amrita Bhattacharjee, Yuxuan Jiang, Canyu Chen, Tianhao Wu, Kai Shu, Lu Cheng, and Huan Liu. 2025. From Generation to Judgment: Opportunities and Challenges of LLM-as-a-judge. https: //arxiv.org/abs/2411.16594

  74. [75]

    Zhiqiu Lin, Deepak Pathak, Baiqi Li, Jiayao Li, Xide Xia, Graham Neubig, Pengchuan Zhang, and Deva Ramanan. 2024. Evaluating Text-to-Visual Generation with Image-to-Text Generation. https://arxiv.org/abs/2404.01291

  75. [76]

    Smith, and Fannie Liu

    Kelly Mack, Rai Ching Ling Hsu, Andrés Monroy-Hernández, Brian A. Smith, and Fannie Liu. 2023. Towards Inclusive Avatars: Disability Representation in Avatar Platforms. InProceedings of the 2023 CHI Conference on Human Factors in Computing Systems(Hamburg, Germany)(CHI ’23). Association for Computing Machinery, New York, NY, USA, Article 607, 13 pages. do...

  76. [77]

    They only care to show us the wheelchair

    Kelly Avery Mack, Rida Qadri, Remi Denton, Shaun K Kane, and Cynthia L Bennett. 2024. “They only care to show us the wheelchair”: disability representation in text-to-image AI models. InProceedings of the 2024 CHI Conference on Human Factors in Computing Systems. 1–23

  77. [78]

    Jabez Magomere, Shu Ishida, Tejumade Afonja, Aya Salama, Daniel Kochin, Yuehgoh Foutse, Imane Hamzaoui, Raesetje Sefala, Aisha Alaagib, Samantha Dalal, et al . 2025. The World Wide recipe: A community-centred framework for fine-grained data collection and regional bias operationalisation. InProceedings of the 2025 ACM Conference on Fairness, Accountabilit...

  78. [79]

    Daniela Massiceti, Camilla Longden, Agnieszka Slowik, Samuel Wills, Martin Grayson, and Cecily Morrison. 2024. Explaining CLIP’s performance disparities on data from blind/low vision users. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 12172–12182

  79. [80]

    Nathan Matias and Megan Price

    J. Nathan Matias and Megan Price. 2025. How public involvement can improve the science of AI.Proceedings of the National Academy of Sciences 122, 48 (2025), e2421111122. doi:10.1073/pnas.2421111122

  80. [81]

    Timothy R McIntosh, Teo Susnjak, Nalin Arachchilage, Tong Liu, Dan Xu, Paul Watters, and Malka N Halgamuge. 2025. Inadequacies of Large Language Model Benchmarks in the Era of Generative Artificial Intelligence.IEEE Transactions on Artificial Intelligence(2025), 1–18. doi:10.1109/tai.2025.3569516

Showing first 80 references.