pith. sign in

arxiv: 2606.20676 · v1 · pith:6GQUJ7AYnew · submitted 2026-06-12 · 💻 cs.CV · cs.AI· cs.CL

Jury Duty: Calibration and Orientation Failures in MLLM-as-a-Judge Under Cultural Ambiguity

Pith reviewed 2026-06-27 04:54 UTC · model grok-4.3

classification 💻 cs.CV cs.AIcs.CL
keywords MLLM-as-a-Judgecultural biascalibration failureorientation failuremultimodal benchmarkhuman agreementpersona prompting
0
0 comments X

The pith

MLLM judges show two distinct failures on culturally ambiguous images: compressed rating scales and defaulting to one cultural norm.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper builds a benchmark of 626 image-prompt pairs drawn from U.S. and mainland Chinese contexts in food, fashion, and architecture. Human annotators agree within each cultural pool but disagree across pools on the same items. When six MLLMs judge these pairs, their errors split into a calibration problem that squeezes scores toward the positive end and an orientation problem that favors one cultural reading. Prompting the model with a persona reduces the scale compression but leaves the cultural default intact. This shows that simple prompting cannot turn the models into neutral judges when the underlying human standards themselves diverge.

Core claim

Across six MLLMs the bias decomposes into a positivity-floor calibration failure, in which the model compresses its scale and rarely uses low scores, and an orientation failure, in which the model defaults to one cultural norm. On items where the two human pools disagree, the floor mechanically endorses the more permissive Chinese reading; persona prompting recovers some calibration but the orientation residual remains, indicating the tilt is not reducible to scale compression. Reference-pool demonstrations deepen the orientation residual and inflate high-end scores rather than restoring low-end use. Model origin contributes a small additive tilt of roughly 0.10 MAE that is roughly invariant

What carries the argument

The VOIR DIRE benchmark of 626 culturally paired image-prompt artifacts, whose contested items are sampled to split the two annotator pools and thereby isolate cultural effects.

If this is right

  • The positivity floor automatically endorses the more permissive Chinese reading on split items.
  • Persona prompting recovers calibration but leaves a surviving orientation bias.
  • Reference-pool in-context examples increase rather than reduce the orientation residual.
  • Model origin produces an additive tilt of about 0.10 MAE that persists across demonstrations.
  • Alignment metrics must be reported separately against each cultural reference pool.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If the two failures are independent, training regimes that only adjust scale behavior will leave cultural defaults untouched.
  • Cross-pool divergence can be treated as an intrinsic property of a judge model rather than noise to be averaged away.
  • The same paired-item design could be applied to other pairs of cultures or to non-visual modalities to test whether the two-failure pattern generalizes.

Load-bearing premise

The two annotator pools are reliable within themselves but genuinely diverge on the same items, and the 626 pairs were chosen so that this divergence isolates cultural disagreement rather than other confounds.

What would settle it

New MLLMs or prompting methods that eliminate the orientation residual on the same contested items while still showing within-pool human reliability.

Figures

Figures reproduced from arXiv: 2606.20676 by Andreas Vlachos, Daniel Lee, Eunkyu Park, Harsh Sharma, Jeonghwan Kim, Kah Mun Chia, Pranav Narayanan Venkit, Shafiq Joty.

Figure 1
Figure 1. Figure 1: Illustrative concept-paired stimuli: each pair shares one evaluative prompt applied to a U.S.-aligned and a [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Mean Q1 Rating by Judge Type and Image Culture (1–5; [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: The cross-pool tilt ∆ concentrates on Food and Beverage and on Fashion and Human Characteristics within Chinese Imagery; Architecture and Symbols is near-neutral. (d) Anchors and contested probes. Ambiguity is graded rather than uniform: within-stratum cross￾pool Pearson ranges from 0.04 to 0.39. Defining an item as a cultural anchor if both pool consensuses are ≥ 4 (high anchor) or both ≤ 2 (low anchor), … view at source ↗
Figure 4
Figure 4. Figure 4: Persona prompts shift judges asymmetrically. Normalized cultural tilt [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: The human annotation interface, shown for a U.S. food item. Annotators see the image and prompt on the [PITH_FULL_IMAGE:figures/full_fig_p014_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Three failure modes encountered while authoring prompt–image pairs. [PITH_FULL_IMAGE:figures/full_fig_p015_6.png] view at source ↗
read the original abstract

MLLM-as-a-Judge is conventionally validated by agreement with human annotations, but this metric is undefined when the human pool is culturally heterogeneous. We introduce VOIR DIRE, a multimodal benchmark of 626 culturally paired image--prompt artifacts spanning U.S. and mainland Chinese contexts across food, fashion, and architecture, with annotator pools that are within-pool reliable (a = 0.86/0.74) but cross-pool divergent on evaluation (Q1 r = -0.12). Across six MLLMs, the bias decomposes into two failures: a positivity-floor calibration failure (compressed scale use) and an orientation failure (default to one cultural norm). On this corpus, where contested items are sampled to split the two pools, the floor mechanically validates the more-permissive Chinese reading; persona prompting partially recovers calibration, but the orientation residual survives, evidence the tilt is not reducible to scale compression. Reference-pool in-context demonstrations deepen the orientation residual and inflate the high end rather than restoring use of the low end. Model origin adds a small additive tilt (~0.10 MAE) that is approximately invariant under demonstration. We recommend reporting alignment against each reference pool separately and treating cross-pool divergence as a judge property.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The manuscript introduces the VOIR DIRE benchmark of 626 culturally paired image-prompt artifacts spanning U.S. and mainland Chinese contexts in food, fashion, and architecture. Human annotator pools are within-pool reliable (α = 0.86/0.74) but cross-pool divergent (Q1 r = -0.12). Across six MLLMs the observed bias is decomposed into a positivity-floor calibration failure (compressed scale use) and an orientation failure (default to one cultural norm). On contested items sampled to split the pools, the floor validates the more-permissive Chinese reading; persona prompting partially recovers calibration but the orientation residual survives, indicating the tilt is not reducible to scale compression. Reference-pool in-context demonstrations deepen the orientation residual. Model origin adds a small additive tilt (~0.10 MAE) invariant under demonstration. The paper recommends reporting alignment against each reference pool separately.

Significance. If the decomposition and isolation of cultural effects hold, the work is significant for MLLM evaluation: it supplies a concrete benchmark, shows that agreement-based validation is undefined under cultural heterogeneity, and demonstrates that orientation bias persists beyond persona prompting. The reported within-pool reliabilities and the explicit separation of calibration versus orientation failures provide a useful framework and falsifiable testbed for future judge alignment studies.

major comments (1)
  1. [Benchmark construction and sampling procedure] Benchmark construction (data sampling procedure): the selection of the 626 items and the contested subset chosen to split the two annotator pools does not report explicit controls, matching, or stratification for non-cultural item properties (visual complexity, prompt phrasing, or domain-specific ambiguity within food/fashion/architecture). The within-pool alphas establish reliability but do not address whether pool divergence (r = -0.12) is attributable solely to cultural norms; absent such controls the surviving orientation residual after persona prompting could reflect residual item confounds rather than a distinct cultural-orientation mechanism, which is load-bearing for the central decomposition claim.
minor comments (2)
  1. [Abstract] Abstract: the phrases 'positivity-floor calibration failure' and 'orientation failure' are used without a one-sentence operational definition; a brief parenthetical gloss would improve immediate readability.
  2. [Results] Results section: the reported MAE additive tilt of ~0.10 from model origin would benefit from an explicit statement of the baseline against which it is measured and the number of items contributing to that statistic.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback. We address the single major comment below.

read point-by-point responses
  1. Referee: [Benchmark construction and sampling procedure] Benchmark construction (data sampling procedure): the selection of the 626 items and the contested subset chosen to split the two annotator pools does not report explicit controls, matching, or stratification for non-cultural item properties (visual complexity, prompt phrasing, or domain-specific ambiguity within food/fashion/architecture). The within-pool alphas establish reliability but do not address whether pool divergence (r = -0.12) is attributable solely to cultural norms; absent such controls the surviving orientation residual after persona prompting could reflect residual item confounds rather than a distinct cultural-orientation mechanism, which is load-bearing for the central decomposition claim.

    Authors: We agree that the manuscript does not report explicit controls, matching, or stratification on non-cultural properties such as visual complexity or prompt phrasing. Item selection was performed via cultural pairing of U.S. and mainland Chinese contexts, with the contested subset defined post hoc by the observed human-pool divergence (r = -0.12) to ensure items that split the annotator pools. This design treats cultural norm divergence as the primary axis, but we acknowledge that unmeasured item-level confounds could contribute to the observed orientation residual. In revision we will add a limitations subsection that explicitly discusses this gap, notes the absence of stratification, and clarifies that the calibration-versus-orientation decomposition is demonstrated on the given corpus rather than proven free of all confounds. The core empirical finding—that persona prompting recovers scale use but leaves an orientation residual—remains intact on the reported data. revision: partial

Circularity Check

0 steps flagged

No significant circularity; new benchmark and external human pools are independent

full rationale

The paper constructs the VOIR DIRE benchmark of 626 items and measures within-pool reliability (α=0.86/0.74) plus cross-pool divergence (r=-0.12) as direct empirical observations on human annotators. Model bias decomposition into positivity-floor calibration failure and orientation failure follows from explicit comparisons against these external pools, with persona prompting tested as an intervention. No equations, fitted parameters, or self-citations are invoked to define the target quantities; the sampling procedure and pool divergence are presented as measured inputs rather than derived outputs. The central claims rest on new data collection and evaluation against independent human references, satisfying the self-contained criterion.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The central claim rests on the new construction of the VOIR DIRE benchmark and the reported properties of the human annotator pools; no free parameters, axioms, or invented entities are introduced beyond standard empirical measurement assumptions.

pith-pipeline@v0.9.1-grok · 5788 in / 1245 out tokens · 47779 ms · 2026-06-27T04:54:10.783131+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

54 extracted references · 23 canonical work pages

  1. [1]

    Gonzalez and Ion Stoica , booktitle=

    Lianmin Zheng and Wei-Lin Chiang and Ying Sheng and Siyuan Zhuang and Zhanghao Wu and Yonghao Zhuang and Zi Lin and Zhuohan Li and Dacheng Li and Eric Xing and Hao Zhang and Joseph E. Gonzalez and Ion Stoica , booktitle=. Judging. 2023 , url=

  2. [2]

    References Improve

    Kejian Shi and Yixin Liu and PeiFeng Wang and Alexander Fabbri and Shafiq Joty and Arman Cohan , booktitle=. References Improve. 2026 , url=

  3. [3]

    Direct Judgement Preference Optimization

    Wang, PeiFeng and Xu, Austin and Zhou, Yilun and Xiong, Caiming and Joty, Shafiq. Direct Judgement Preference Optimization. Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing. 2025. doi:10.18653/v1/2025.emnlp-main.103

  4. [4]

    The Fourteenth International Conference on Learning Representations , year=

    Foundational Automatic Evaluators: Scaling Multi-Task Generative Evaluator Training for Reasoning-Centric Domains , author=. The Fourteenth International Conference on Learning Representations , year=

  5. [5]

    Proceedings of the 41st International Conference on Machine Learning , articleno =

    Chen, Dongping and Chen, Ruoxi and Zhang, Shilin and Wang, Yaochen and Liu, Yinuo and Zhou, Huichi and Zhang, Qihui and Wan, Yao and Zhou, Pan and Sun, Lichao , title =. Proceedings of the 41st International Conference on Machine Learning , articleno =. 2024 , publisher =

  6. [6]

    Lianghui Zhu and Xinggang Wang and Xinlong Wang , booktitle=. Judge. 2025 , url=

  7. [7]

    The Twelfth International Conference on Learning Representations , year=

    Prometheus: Inducing Fine-Grained Evaluation Capability in Language Models , author=. The Twelfth International Conference on Learning Representations , year=

  8. [8]

    First Conference on Language Modeling , year=

    Length-Controlled AlpacaEval: A Simple Debiasing of Automatic Evaluators , author=. First Conference on Language Modeling , year=

  9. [9]

    A Closer Look into Using Large Language Models for Automatic Evaluation

    Chiang, Cheng-Han and Lee, Hung-yi. A Closer Look into Using Large Language Models for Automatic Evaluation. Findings of the Association for Computational Linguistics: EMNLP 2023. 2023. doi:10.18653/v1/2023.findings-emnlp.599

  10. [10]

    G -eval: NLG evaluation using gpt-4 with better human alignment

    Liu, Yang and Iter, Dan and Xu, Yichong and Wang, Shuohang and Xu, Ruochen and Zhu, Chenguang. G -Eval: NLG Evaluation using Gpt-4 with Better Human Alignment. Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing. 2023. doi:10.18653/v1/2023.emnlp-main.153

  11. [11]

    Proceedings of the 40th International Conference on Machine Learning , articleno =

    Santurkar, Shibani and Durmus, Esin and Ladhak, Faisal and Lee, Cinoo and Liang, Percy and Hashimoto, Tatsunori , title =. Proceedings of the 40th International Conference on Machine Learning , articleno =. 2023 , publisher =

  12. [12]

    First Conference on Language Modeling , year=

    Towards Measuring the Representation of Subjective Global Opinions in Language Models , author=. First Conference on Language Modeling , year=

  13. [13]

    ArXiv , year=

    Evaluating Large Language Models Trained on Code , author=. ArXiv , year=

  14. [14]

    2024 , url=

    Seonghyeon Ye and Doyoung Kim and Sungdong Kim and Hyeonbin Hwang and Seungone Kim and Yongrae Jo and James Thorne and Juho Kim and Minjoon Seo , booktitle=. 2024 , url=

  15. [15]

    The ``problem'' of human label variation: On ground truth in data, modeling and evaluation

    Plank, Barbara. The ``Problem'' of Human Label Variation: On Ground Truth in Data, Modeling and Evaluation. Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing. 2022. doi:10.18653/v1/2022.emnlp-main.731

  16. [16]

    Inherent Disagreements in Human Textual Inferences

    Pavlick, Ellie and Kwiatkowski, Tom. Inherent Disagreements in Human Textual Inferences. Transactions of the Association for Computational Linguistics. 2019. doi:10.1162/tacl_a_00293

  17. [17]

    and Ritter, Alan and Xu, Wei , title =

    Naous, Tarek and Ryan, Michael J and Ritter, Alan and Xu, Wei. Having Beer after Prayer? Measuring Cultural Bias in Large Language Models. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2024. doi:10.18653/v1/2024.acl-long.862

  18. [18]

    2025 , eprint=

    CultureVLM: Characterizing and Improving Cultural Understanding of Vision-Language Models for over 100 Countries , author=. 2025 , eprint=

  19. [19]

    Benchmarking Vision Language Models for Cultural Understanding

    Nayak, Shravan and Jain, Kanishk and Awal, Rabiul and Reddy, Siva and Steenkiste, Sjoerd Van and Hendricks, Lisa Anne and Stanczak, Karolina and Agrawal, Aishwarya. Benchmarking Vision Language Models for Cultural Understanding. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing. 2024. doi:10.18653/v1/2024.emnlp-main.329

  20. [20]

    F oodie QA : A Multimodal Dataset for Fine-Grained Understanding of C hinese Food Culture

    Li, Wenyan and Zhang, Crystina and Li, Jiaang and Peng, Qiwei and Tang, Raphael and Zhou, Li and Zhang, Weijia and Hu, Guimin and Yuan, Yifei and S gaard, Anders and Hershcovich, Daniel and Elliott, Desmond. F oodie QA : A Multimodal Dataset for Fine-Grained Understanding of C hinese Food Culture. Proceedings of the 2024 Conference on Empirical Methods in...

  21. [21]

    WHEN TOM EATS KIMCHI : Evaluating Cultural Awareness of Multimodal Large Language Models in Cultural Mixture Contexts

    Kim, Jun Seong and Thu, Kyaw Ye and Ismayilzada, Javad and Park, Junyeong and Kim, Eunsu and Ahmad, Huzama and An, Na Min and Thorne, James and Oh, Alice. WHEN TOM EATS KIMCHI : Evaluating Cultural Awareness of Multimodal Large Language Models in Cultural Mixture Contexts. Proceedings of the 3rd Workshop on Cross-Cultural Considerations in NLP (C3NLP 2025...

  22. [22]

    Can Large Language Models Be an Alternative to Human Evaluations?

    Chiang, Cheng-Han and Lee, Hung-yi. Can Large Language Models Be an Alternative to Human Evaluations?. Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2023. doi:10.18653/v1/2023.acl-long.870

  23. [23]

    2024 , eprint=

    GPT-4 Technical Report , author=. 2024 , eprint=

  24. [24]

    GPTS core: Evaluate as You Desire

    Fu, Jinlan and Ng, See-Kiong and Jiang, Zhengbao and Liu, Pengfei. GPTS core: Evaluate as You Desire. Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers). 2024. doi:10.18653/v1/2024.naacl-long.365

  25. [25]

    Thirty-seventh Conference on Neural Information Processing Systems , year=

    AlpacaFarm: A Simulation Framework for Methods that Learn from Human Feedback , author=. Thirty-seventh Conference on Neural Information Processing Systems , year=

  26. [26]

    2023 , address =

    Min, Sewon and Krishna, Kalpesh and Lyu, Xinxi and Lewis, Mike and Yih, Wen-tau and Koh, Pang and Iyyer, Mohit and Zettlemoyer, Luke and Hajishirzi, Hannaneh. FA ct S core: Fine-grained Atomic Evaluation of Factual Precision in Long Form Text Generation. Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing. 2023. doi:10.1...

  27. [27]

    and Li, Tianle and Li, Dacheng and Zhu, Banghua and Zhang, Hao and Jordan, Michael I

    Chiang, Wei-Lin and Zheng, Lianmin and Sheng, Ying and Angelopoulos, Anastasios N. and Li, Tianle and Li, Dacheng and Zhu, Banghua and Zhang, Hao and Jordan, Michael I. and Gonzalez, Joseph E. and Stoica, Ion , title =. Proceedings of the 41st International Conference on Machine Learning , articleno =. 2024 , publisher =

  28. [28]

    Proceedings of the 41st International Conference on Machine Learning , articleno =

    Lee, Harrison and Phatale, Samrat and Mansoor, Hassan and Mesnard, Thomas and Ferret, Johan and Lu, Kellie and Bishop, Colton and Hall, Ethan and Carbune, Victor and Rastogi, Abhinav and Prakash, Sushant , title =. Proceedings of the 41st International Conference on Machine Learning , articleno =. 2024 , publisher =

  29. [29]

    Tianyi Xiong and Xiyao Wang and Dong Guo and Qinghao Ye and Haoqi Fan and Quanquan Gu and Heng Huang and Chunyuan Li , year=

  30. [30]

    VLF eedback: A Large-Scale AI Feedback Dataset for Large Vision-Language Models Alignment

    Li, Lei and Xie, Zhihui and Li, Mukai and Chen, Shunian and Wang, Peiyi and Chen, Liang and Yang, Yazheng and Wang, Benyou and Kong, Lingpeng and Liu, Qi. VLF eedback: A Large-Scale AI Feedback Dataset for Large Vision-Language Models Alignment. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing. 2024. doi:10.18653/v1/2...

  31. [31]

    The Twelfth International Conference on Learning Representations , year=

    Generative Judge for Evaluating Alignment , author=. The Twelfth International Conference on Learning Representations , year=

  32. [32]

    CLIP- Score: A reference-free evaluation metric for image captioning

    Hessel, Jack and Holtzman, Ari and Forbes, Maxwell and Le Bras, Ronan and Choi, Yejin. CLIPS core: A Reference-free Evaluation Metric for Image Captioning. Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. 2021. doi:10.18653/v1/2021.emnlp-main.595

  33. [33]

    and Parikh, Devi , title =

    Vedantam, Ramakrishna and Lawrence Zitnick, C. and Parikh, Devi , title =. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , month =

  34. [34]

    2010 , journal =

    Joseph Henrich and Steven Heine and Ara Norenzayan , title =. 2010 , journal =

  35. [35]

    2023 , eprint=

    Verbosity Bias in Preference Labeling by Large Language Models , author=. 2023 , eprint=

  36. [36]

    Large Language Models are not Fair Evaluators

    Wang, Peiyi and Li, Lei and Chen, Liang and Cai, Zefan and Zhu, Dawei and Lin, Binghuai and Cao, Yunbo and Kong, Lingpeng and Liu, Qi and Liu, Tianyu and Sui, Zhifang. Large Language Models are not Fair Evaluators. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2024. doi:10.18653/v1/2024.ac...

  37. [37]

    Bowman and Shi Feng , booktitle=

    Arjun Panickssery and Samuel R. Bowman and Shi Feng , booktitle=. 2024 , url=

  38. [38]

    Humans or

    Chen, Guiming Hardy and Chen, Shunian and Liu, Ziche and Jiang, Feng and Wang, Benyou. Humans or LLM s as the Judge? A Study on Judgement Bias. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing. 2024. doi:10.18653/v1/2024.emnlp-main.474

  39. [39]

    and et al

    Gallegos, Isabel O. and Rossi, Ryan A. and Barrow, Joe and Tanjim, Md Mehrab and Kim, Sungchul and Dernoncourt, Franck and Yu, Tong and Zhang, Ruiyi and Ahmed, Nesreen K. Bias and Fairness in Large Language Models: A Survey. Computational Linguistics. 2024. doi:10.1162/coli_a_00524

  40. [40]

    Proceedings of the 2023 ACM Conference on Fairness, Accountability, and Transparency , pages =

    Bianchi, Federico and Kalluri, Pratyusha and Durmus, Esin and Ladhak, Faisal and Cheng, Myra and Nozza, Debora and Hashimoto, Tatsunori and Jurafsky, Dan and Zou, James and Caliskan, Aylin , title =. Proceedings of the 2023 ACM Conference on Fairness, Accountability, and Transparency , pages =. 2023 , isbn =. doi:10.1145/3593013.3594095 , abstract =

  41. [41]

    Proceedings of the AAAI Conference on Artificial Intelligence , volume=

    Rewind and Render: Towards Factually Accurate Text-to-Video Generation with Distilled Knowledge Retrieval , author=. Proceedings of the AAAI Conference on Artificial Intelligence , volume=. 2025 , doi=

  42. [42]

    Knowledge of cultural moral norms in large language models

    Ramezani, Aida and Xu, Yang. Knowledge of cultural moral norms in large language models. Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2023. doi:10.18653/v1/2023.acl-long.26

  43. [43]

    Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages =

    AlKhamissi, Badr and ElNokrashy, Muhammad and Alkhamissi, Mai and Diab, Mona. Investigating Cultural Alignment of Large Language Models. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2024. doi:10.18653/v1/2024.acl-long.671

  44. [44]

    METAL : Towards Multilingual Meta-Evaluation

    Hada, Rishav and Gumma, Varun and Ahmed, Mohamed and Bali, Kalika and Sitaram, Sunayana. METAL : Towards Multilingual Meta-Evaluation. Findings of the Association for Computational Linguistics: NAACL 2024. 2024. doi:10.18653/v1/2024.findings-naacl.148

  45. [45]

    CulturalBench:

    Chiu, Yu Ying and Jiang, Liwei and Lin, Bill Yuchen and Park, Chan Young and Li, Shuyue Stella and Ravi, Sahithya and Bhatia, Mehar and Antoniak, Maria and Tsvetkov, Yulia and Shwartz, Vered and Choi, Yejin. C ultural B ench: A Robust, Diverse and Challenging Benchmark for Measuring LM s' Cultural Knowledge Through Human- AI Red-Teaming. Proceedings of th...

  46. [46]

    2024 , url=

    Junho Myung and Nayeon Lee and Yi Zhou and Jiho Jin and Rifki Afina Putri and Dimosthenis Antypas and Hsuvas Borkakoty and Eunsu Kim and Carla Perez-Almendros and Abinew Ali Ayele and Victor Gutierrez Basulto and Yazmin Ibanez-Garcia and Hwaran Lee and Shamsuddeen Hassan Muhammad and Kiwoong Park and Anar Sabuhi Rzayev and Nina White and Seid Muhie Yimam ...

  47. [47]

    2023 , eprint=

    Inspecting the Geographical Representativeness of Images from Text-to-Image Models , author=. 2023 , eprint=

  48. [48]

    2024 , eprint=

    CVQA: Culturally-diverse Multilingual Visual Question Answering Benchmark , author=. 2024 , eprint=

  49. [49]

    C onv KGY arn: Spinning Configurable and Scalable Conversational Knowledge Graph QA Datasets with Large Language Models

    Pradeep, Ronak and Lee, Daniel and Mousavi, Ali and Pound, Jeffrey and Sang, Yisi and Lin, Jimmy and Ilyas, Ihab and Potdar, Saloni and Arefiyan, Mostafa and Li, Yunyao. C onv KGY arn: Spinning Configurable and Scalable Conversational Knowledge Graph QA Datasets with Large Language Models. Proceedings of the 2024 Conference on Empirical Methods in Natural...

  50. [50]

    AVA: A large-scale database for aesthetic visual analysis , year=

    Murray, Naila and Marchesotti, Luca and Perronnin, Florent , booktitle=. AVA: A large-scale database for aesthetic visual analysis , year=

  51. [51]

    Photo Aesthetics Ranking Network with Attributes and Content Adaptation , journal =

    Shu Kong and Xiaohui Shen and Zhe Lin and Radom. Photo Aesthetics Ranking Network with Attributes and Content Adaptation , journal =. 2016 , url =. 1606.01621 , timestamp =

  52. [52]

    Thirty-seventh Conference on Neural Information Processing Systems , year=

    Pick-a-Pic: An Open Dataset of User Preferences for Text-to-Image Generation , author=. Thirty-seventh Conference on Neural Information Processing Systems , year=

  53. [53]

    Thirty-seventh Conference on Neural Information Processing Systems , year=

    ImageReward: Learning and Evaluating Human Preferences for Text-to-Image Generation , author=. Thirty-seventh Conference on Neural Information Processing Systems , year=

  54. [54]

    Klaus Krippendorff , title =