pith. sign in

arxiv: 2406.14194 · v3 · submitted 2024-06-20 · 💻 cs.CV · cs.AI

VLBiasBench: A Comprehensive Benchmark for Evaluating Bias in Large Vision-Language Model

Pith reviewed 2026-05-23 23:35 UTC · model grok-4.3

classification 💻 cs.CV cs.AI
keywords bias benchmarkvision-language modelssocial bias evaluationintersectional biaslarge-scale datasetopen-ended questionsclosed-ended questionsmodel bias assessment
0
0 comments X

The pith

VLBiasBench supplies a dataset of 128k samples across eleven bias categories to evaluate social biases in large vision-language models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper sets out to overcome the narrow scope of earlier benchmarks by building a much larger collection of image-question pairs that span nine social bias types plus two intersectional ones. It generates 46,848 images and pairs them with both open-ended and closed-ended questions to reach 128,342 total samples. The resulting benchmark is then run on seventeen models, producing observations about how those models handle the targeted biases. A reader would care because vision-language systems are moving into widespread use and unchecked bias in their outputs can affect decisions in hiring, policing, and daily information access.

Core claim

VLBiasBench is a benchmark that covers nine social bias categories (age, disability status, gender, nationality, physical appearance, race, religion, profession, social economic status) and two intersectional categories (race x gender, race x social economic status), constructed from 46,848 Stable Diffusion XL images combined with open- and closed-ended questions into 128,342 samples, and applied to fifteen open-source plus two closed-source models to produce new observations on bias patterns.

What carries the argument

The VLBiasBench dataset of generated images paired with bias-directed questions in both open-ended and closed-ended formats.

If this is right

  • Evaluations can now address nine social biases plus two intersectional combinations in a single resource.
  • Both open-ended and closed-ended questions supply complementary views on how models express bias.
  • Results from seventeen models supply concrete observations about bias levels in current open-source and closed-source systems.
  • The scale of 128,342 samples supports statistical comparisons across bias types and model families.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Future work could use the same image-question structure to test whether fine-tuning or prompting strategies reduce the measured biases.
  • The benchmark format might extend to other multimodal systems that process both images and text.
  • Large-scale category coverage makes it feasible to study whether bias strength varies systematically with model size or training data.

Load-bearing premise

Images produced by Stable Diffusion XL accurately and neutrally represent the intended social categories so that measured biases can be attributed to the vision-language models rather than to artifacts in the images themselves.

What would settle it

Re-running the evaluations with real photographs instead of generated images and obtaining materially different bias scores would indicate that the original measurements depend on generation artifacts rather than the targeted social categories.

Figures

Figures reproduced from arXiv: 2406.14194 by Jie Zhang, Shiguang Shan, Sibo Wang, Wen Gao, Xiangkui Cao, Xilin Chen, Zheng Yuan.

Figure 1
Figure 1. Figure 1: Framework of synthetic image generation (top), along with specific examples of evaluations for open-ended and close-ended questions (bottom left [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Prompt library construction for open-ended question evaluation. (a) combination-based construction. (b) automatical construction. [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Construction of the “Base” and “Scene” datasets. (a) shows an ambiguous sample from the “Base” dataset, where the context is dominated by textual [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Text-induced dataset construction for close-ended question evaluation. We first extract prompts from the corpus to generate images. Attribute information [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Close-ended evaluation framework. We contrast text-dominant con [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Example for stereotype-based bias. The model’s response stereotypes elderly people as being unfamiliar with using smartphones. [PITH_FULL_IMAGE:figures/full_fig_p015_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Example for illusion-based bias. The model’s response includes information that never appears in the image and context. [PITH_FULL_IMAGE:figures/full_fig_p015_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Open-ended evaluation on race bias. Otter exhibits bias against Africans (assuming him to be criminal), while LLaVA and Gemini provide relatively [PITH_FULL_IMAGE:figures/full_fig_p021_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Open-ended evaluation on gender bias. Otter shows bias against female (assuming her to be a thief). In contrast, LLaVA and Gemini does not show [PITH_FULL_IMAGE:figures/full_fig_p022_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Open-ended evaluation on religious bias. [PITH_FULL_IMAGE:figures/full_fig_p023_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Open-ended evaluation on profession bias related to gender. Otter links programmer with male, whereas Gemini do not establish such gender-based [PITH_FULL_IMAGE:figures/full_fig_p024_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Close-ended evaluation on race (ambiguous example from “Base” subset). [PITH_FULL_IMAGE:figures/full_fig_p024_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Close-ended evaluation on race x SES (ambiguous example from “Scene” subset). Type: Ambiguous “Scene Text” ------------------------------------------------------------ ID : 50805 Correct Answer: Cannot answer Instruction: The following is a multiple-choice question with a single correct answer. You should directly answer the question based on the content of the options. Question: The person in the image i… view at source ↗
Figure 14
Figure 14. Figure 14: Close-ended evaluation on appearance (ambiguous example from “Scene Text” subset). [PITH_FULL_IMAGE:figures/full_fig_p025_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: Close-ended evaluation on SES (disambiguated example from “Base” subset). [PITH_FULL_IMAGE:figures/full_fig_p026_15.png] view at source ↗
read the original abstract

The emergence of Large Vision-Language Models (LVLMs) marks significant strides towards achieving general artificial intelligence. However, these advancements are accompanied by concerns about biased outputs, a challenge that has yet to be thoroughly explored. Existing benchmarks are not sufficiently comprehensive in evaluating biases due to their limited data scale, single questioning format and narrow sources of bias. To address this problem, we introduce VLBiasBench, a comprehensive benchmark designed to evaluate biases in LVLMs. VLBiasBench, features a dataset that covers nine distinct categories of social biases, including age, disability status, gender, nationality, physical appearance, race, religion, profession, social economic status, as well as two intersectional bias categories: race x gender and race x social economic status. To build a large-scale dataset, we use Stable Diffusion XL model to generate 46,848 high-quality images, which are combined with various questions to creat 128,342 samples. These questions are divided into open-ended and close-ended types, ensuring thorough consideration of bias sources and a comprehensive evaluation of LVLM biases from multiple perspectives. We conduct extensive evaluations on 15 open-source models as well as two advanced closed-source models, yielding new insights into the biases present in these models. Our benchmark is available at https://github.com/Xiangkui-Cao/VLBiasBench.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces VLBiasBench, a benchmark for evaluating social biases in LVLMs. It covers nine bias categories (age, disability, gender, nationality, physical appearance, race, religion, profession, socioeconomic status) plus two intersectional ones (race×gender, race×socioeconomic status), generated via Stable Diffusion XL to produce 46,848 images and 128,342 question-image samples (open- and closed-ended), and reports evaluations on 17 models yielding new insights into LVLM biases.

Significance. If the generated images can be shown to accurately and cleanly instantiate the target attributes without confounding artifacts, the benchmark would meaningfully expand prior work by its scale, breadth of categories, and dual question formats, providing a reusable resource for systematic LVLM bias measurement.

major comments (2)
  1. [Dataset construction] Dataset construction (abstract and § on image generation): the central claim that model responses can be attributed to the nine labeled bias categories (plus intersections) rests on the assumption that the 46,848 SDXL-generated images accurately depict the intended attributes without systematic visual confounders; no quantitative validation (human agreement rates, attribute-classifier accuracy on held-out images, or prompt-ablation results) is reported.
  2. [Evaluation methodology] Evaluation and scoring methodology (abstract and results section): the paper states that evaluations on 15 open-source + 2 closed-source models 'yield new insights,' yet provides no description of the bias-scoring procedure for open-ended responses, inter-annotator agreement, or how closed-ended accuracy is aggregated into bias metrics; this makes the reliability of the reported insights impossible to assess.
minor comments (2)
  1. [Abstract] Abstract contains a typo: 'creat 128,342 samples' should be 'create 128,342 samples.'
  2. [Abstract] The abstract states '15 open-source models as well as two advanced closed-source models' (total 17) but the title and introduction should consistently report the exact model count and list the models evaluated.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive feedback on VLBiasBench. The comments identify areas where additional detail would strengthen the paper, and we address each point below with plans for revision.

read point-by-point responses
  1. Referee: [Dataset construction] Dataset construction (abstract and § on image generation): the central claim that model responses can be attributed to the nine labeled bias categories (plus intersections) rests on the assumption that the 46,848 SDXL-generated images accurately depict the intended attributes without systematic visual confounders; no quantitative validation (human agreement rates, attribute-classifier accuracy on held-out images, or prompt-ablation results) is reported.

    Authors: We agree that quantitative validation of the generated images is necessary to support clean attribution of model responses to the target bias categories. The manuscript describes the use of attribute-specific prompts with SDXL but does not report human agreement rates, classifier accuracy, or ablation results. In the revised version we will add a validation subsection that includes human evaluation on a held-out sample of images (reporting agreement rates with intended attributes) and any available automated checks. This directly addresses the concern. revision: yes

  2. Referee: [Evaluation methodology] Evaluation and scoring methodology (abstract and results section): the paper states that evaluations on 15 open-source + 2 closed-source models 'yield new insights,' yet provides no description of the bias-scoring procedure for open-ended responses, inter-annotator agreement, or how closed-ended accuracy is aggregated into bias metrics; this makes the reliability of the reported insights impossible to assess.

    Authors: We acknowledge that the absence of a detailed scoring description limits assessment of the reported insights. The current manuscript presents evaluation results without specifying the exact procedure for scoring open-ended responses, any inter-annotator agreement, or the aggregation of closed-ended accuracies into bias metrics. We will expand the evaluation section in the revision to include a full description of the bias-scoring pipeline for both question types, the method used to derive the metrics, and any agreement statistics if human judgment was involved. revision: yes

Circularity Check

0 steps flagged

No circularity: benchmark construction and evaluations are independent

full rationale

The paper constructs VLBiasBench by generating images via an external model (Stable Diffusion XL) and pairing them with questions across bias categories, then reports evaluations on 17 LVLMs. No equations, fitted parameters, predictions, or derivations appear in the provided text. No self-citations are invoked as load-bearing support for any uniqueness claim or ansatz. The central contribution (dataset scale and coverage) does not reduce to its own inputs by definition or statistical forcing, satisfying the criteria for a self-contained benchmark paper with score 0.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central construction step depends on one domain assumption about image generation fidelity; no free parameters or new entities are introduced.

axioms (1)
  • domain assumption Stable Diffusion XL generates images that faithfully represent the nine social bias categories without introducing confounding visual artifacts or biases.
    Invoked to justify the 46,848-image dataset as a valid testbed for model bias.

pith-pipeline@v0.9.0 · 5790 in / 1116 out tokens · 24051 ms · 2026-05-23T23:35:39.751213+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 4 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Causal Bias Detection in Generative Artificial Intelligence

    cs.AI 2026-05 unverdicted novelty 7.0

    Develops a causal framework unifying generative AI fairness with standard ML, with new decompositions, identification conditions, and estimators demonstrated on LLM race and gender bias.

  2. CrossCult-KIBench: A Benchmark for Cross-Cultural Knowledge Insertion in MLLMs

    cs.AI 2026-05 unverdicted novelty 7.0

    CrossCult-KIBench provides 9,800 test cases for cross-cultural knowledge insertion in MLLMs and shows that existing methods cannot reliably adapt to one culture while preserving behavior in others.

  3. CrossCult-KIBench: A Benchmark for Cross-Cultural Knowledge Insertion in MLLMs

    cs.AI 2026-05 unverdicted novelty 7.0

    CrossCult-KIBench is a new benchmark for evaluating cross-cultural knowledge insertion in MLLMs, paired with the MCKI baseline method, showing current approaches fail to balance adaptation and preservation.

  4. Causal Bias Detection in Generative Artificial Intelligence

    cs.AI 2026-05 unverdicted novelty 6.0

    A causal framework unifies fairness analysis across generative AI and standard ML by deriving decompositions that separate biases along causal pathways and differences between real-world and model mechanisms.

Reference graph

Works this paper leans on

65 extracted references · 65 canonical work pages · cited by 2 Pith papers · 12 internal anchors

  1. [1]

    GPT-4 Technical Report

    J. Achiam, S. Adler, S. Agarwal, L. Ahmad, I. Akkaya, F. L. Aleman, D. Almeida, J. Altenschmidt, S. Altman, S. Anadkat et al. , “Gpt-4 technical report,” arXiv preprint arXiv:2303.08774 , 2023

  2. [2]

    MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models

    D. Zhu, J. Chen, X. Shen, X. Li, and M. Elhoseiny, “Minigpt-4: Enhancing vision-language understanding with advanced large language models,” arXiv preprint arXiv:2304.10592 , 2023

  3. [3]

    Visual instruction tuning,

    H. Liu, C. Li, Q. Wu, and Y . J. Lee, “Visual instruction tuning,” in Advances in neural information processing systems (NIPS) , vol. 36, 2024

  4. [4]

    Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality,

    W.-L. Chiang, Z. Li, Z. Lin, Y . Sheng, Z. Wu, H. Zhang, L. Zheng, S. Zhuang, Y . Zhuang, J. E. Gonzalez et al., “Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality,” See https://vicuna. lmsys. org (accessed 14 April 2023) , vol. 2, no. 3, p. 6, 2023

  5. [5]

    LLaMA: Open and Efficient Foundation Language Models

    H. Touvron, T. Lavril, G. Izacard, X. Martinet, M.-A. Lachaux, T. Lacroix, B. Rozi `ere, N. Goyal, E. Hambro, F. Azhar et al. , “Llama: Open and efficient foundation language models,”arXiv preprint arXiv:2302.13971, 2023

  6. [6]

    Are gender-neutral queries really gender-neutral? mitigating gender bias in image search,

    J. Wang, Y . Liu, and X. E. Wang, “Are gender-neutral queries really gender-neutral? mitigating gender bias in image search,” arXiv preprint arXiv:2109.05433, 2021

  7. [7]

    Black is to Criminal as Caucasian is to Police: Detecting and Removing Multiclass Bias in Word Embeddings

    T. Manzini, Y . C. Lim, Y . Tsvetkov, and A. W. Black, “Black is to criminal as caucasian is to police: Detecting and removing multiclass bias in word embeddings,” arXiv preprint arXiv:1904.04047 , 2019

  8. [8]

    Avibench: Towards evaluating the robustness of large vision-language model on adversarial visual-instructions,

    H. Zhang, W. Shao, H. Liu, Y . Ma, P. Luo, Y . Qiao, and K. Zhang, “Avibench: Towards evaluating the robustness of large vision-language model on adversarial visual-instructions,”arXiv preprint arXiv:2403.09346, 2024

  9. [9]

    Assessment of multimodal large language models in alignment with human values,

    Z. Shi, Z. Wang, H. Fan, Z. Zhang, L. Li, Y . Zhang, Z. Yin, L. Sheng, Y . Qiao, and J. Shao, “Assessment of multimodal large language models in alignment with human values,” arXiv preprint arXiv:2403.17830 , 2024

  10. [10]

    Bold: Dataset and metrics for measuring biases in open-ended language generation,

    J. Dhamala, T. Sun, V . Kumar, S. Krishna, Y . Pruksachatkun, K.-W. Chang, and R. Gupta, “Bold: Dataset and metrics for measuring biases in open-ended language generation,” in Proceedings of the 2021 ACM conference on fairness, accountability, and transparency (ACM FAccT) , 2021, pp. 862–872

  11. [11]

    Bbq: A hand-built bias benchmark for question answering,

    A. Parrish, A. Chen, N. Nangia, V . Padmakumar, J. Phang, J. Thompson, P. M. Htut, and S. R. Bowman, “Bbq: A hand-built bias benchmark for question answering,” in Findings of the Association for Computational Linguistics: ACL, 2022

  12. [12]

    Fairface: Face attribute dataset for balanced race, gender, and age for bias measurement and mitigation,

    K. Karkkainen and J. Joo, “Fairface: Face attribute dataset for balanced race, gender, and age for bias measurement and mitigation,” in Proceed- ings of the IEEE/CVF winter conference on applications of computer vision (WACV), 2021, pp. 1548–1558

  13. [13]

    Age progression/regression by con- ditional adversarial autoencoder,

    Z. Zhang, Y . Song, and H. Qi, “Age progression/regression by con- ditional adversarial autoencoder,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR) , 2017, pp. 5810–5818

  14. [14]

    Microsoft coco: Common objects in context,

    T.-Y . Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Doll ´ar, and C. L. Zitnick, “Microsoft coco: Common objects in context,” in Proceedings of the European Conference on Computer Vision (ECCV). Springer, 2014, pp. 740–755

  15. [15]

    Flickr30k entities: Collecting region-to-phrase correspondences for richer image-to-sentence models,

    B. A. Plummer, L. Wang, C. M. Cervantes, J. C. Caicedo, J. Hocken- maier, and S. Lazebnik, “Flickr30k entities: Collecting region-to-phrase correspondences for richer image-to-sentence models,” in Proceedings of the IEEE/CVF international conference on computer vision (CVPR) , 2015, pp. 2641–2649

  16. [16]

    Learning transferable visual models from natural language supervision,

    A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark et al., “Learning transferable visual models from natural language supervision,” in International conference on machine learning (ICML). PMLR, 2021, pp. 8748–8763

  17. [17]

    Measuring and mitigating bias in vision-and- language models,

    F. Chen and Z.-Y . Dou, “Measuring and mitigating bias in vision-and- language models,” Information Processing & Management , 2024

  18. [18]

    Leveraging clip for inferring sen- sitive information and improving model fairness,

    M. Zhang and R. Chunara, “Leveraging clip for inferring sen- sitive information and improving model fairness,” arXiv preprint arXiv:2403.10624, 2024

  19. [19]

    Mitigating test-time bias for fair image retrieval,

    F. Kong, S. Yuan, W. Hao, and R. Henao, “Mitigating test-time bias for fair image retrieval,” Advances in Neural Information Processing Systems (NIPS), vol. 36, 2024

  20. [20]

    Dear: Debiasing vision-language models with additive residuals,

    A. Seth, M. Hemani, and C. Agarwal, “Dear: Debiasing vision-language models with additive residuals,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023, pp. 6820–6829

  21. [21]

    Multimodal bias: Introducing a framework for stereotypical bias assessment beyond gender and race in vision language models,

    S. Janghorbani and G. De Melo, “Multimodal bias: Introducing a framework for stereotypical bias assessment beyond gender and race in vision language models,” arXiv preprint arXiv:2303.12734 , 2023

  22. [22]

    Vlstereoset: A study of stereotypical bias in pre-trained vision-language models,

    K. Zhou, Y . LAI, and J. Jiang, “Vlstereoset: A study of stereotypical bias in pre-trained vision-language models,” in Annual Meeting of the Association for Computational Linguistics (ACL) , 2022

  23. [23]

    Visogender: A dataset for benchmarking gender bias in image-text pronoun resolution,

    S. M. Hall, F. Gonc ¸alves Abrantes, H. Zhu, G. Sodunke, A. Shtedritski, and H. R. Kirk, “Visogender: A dataset for benchmarking gender bias in image-text pronoun resolution,” Advances in Neural Information Processing Systems (NIPS) , vol. 36, 2024

  24. [24]

    Socialcounterfactuals: Probing and mitigating intersectional social biases in vision-language models with counterfactual examples,

    P. Howard, A. Madasu, T. Le, G. L. Moreno, A. Bhiwandiwalla, and V . Lal, “Socialcounterfactuals: Probing and mitigating intersectional social biases in vision-language models with counterfactual examples,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024, pp. 11 975–11 985

  25. [25]

    An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

    A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly et al., “An image is worth 16x16 words: Transformers for image recognition at scale,” arXiv preprint arXiv:2010.11929 , 2020

  26. [26]

    Improved baselines with visual instruction tuning,

    H. Liu, C. Li, Y . Li, and Y . J. Lee, “Improved baselines with visual instruction tuning,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , 2024, pp. 26 296– 26 306. JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2021 13

  27. [27]

    Gemini: A Family of Highly Capable Multimodal Models

    G. Team, R. Anil, S. Borgeaud, Y . Wu, J.-B. Alayrac, J. Yu, R. Soricut, J. Schalkwyk, A. M. Dai, A. Hauth et al., “Gemini: a family of highly capable multimodal models,” arXiv preprint arXiv:2312.11805 , 2023

  28. [28]

    SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis

    D. Podell, Z. English, K. Lacey, A. Blattmann, T. Dockhorn, J. M ¨uller, J. Penna, and R. Rombach, “Sdxl: Improving latent diffusion models for high-resolution image synthesis,” arXiv preprint arXiv:2307.01952 , 2023

  29. [29]

    Vision-language models for vision tasks: A survey,

    J. Zhang, J. Huang, S. Jin, and S. Lu, “Vision-language models for vision tasks: A survey,” IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI), 2024

  30. [30]

    Blip-2: Bootstrapping language- image pre-training with frozen image encoders and large language models,

    J. Li, D. Li, S. Savarese, and S. Hoi, “Blip-2: Bootstrapping language- image pre-training with frozen image encoders and large language models,” in International conference on machine learning (ICML) . PMLR, 2023, pp. 19 730–19 742

  31. [31]

    MiniGPT-v2: large language model as a unified interface for vision-language multi-task learning

    J. Chen, D. Zhu, X. Shen, X. Li, Z. Liu, P. Zhang, R. Krishnamoorthi, V . Chandra, Y . Xiong, and M. Elhoseiny, “Minigpt-v2: large language model as a unified interface for vision-language multi-task learning,” arXiv preprint arXiv:2310.09478 , 2023

  32. [32]

    Mimic-it: Multi-modal in-context instruction tuning,

    B. Li, Y . Zhang, L. Chen, J. Wang, F. Pu, J. Yang, C. Li, and Z. Liu, “Mimic-it: Multi-modal in-context instruction tuning,” arXiv preprint arXiv:2306.05425, 2023

  33. [33]

    Instructblip: Towards general-purpose vision- language models with instruction tuning,

    W. Dai, J. Li, D. Li, A. M. H. Tiong, J. Zhao, W. Wang, B. Li, P. N. Fung, and S. Hoi, “Instructblip: Towards general-purpose vision- language models with instruction tuning,” Advances in Neural Informa- tion Processing Systems (NIPS) , vol. 36, 2024

  34. [34]

    InternLM-XComposer: A Vision-Language Large Model for Advanced Text-image Comprehension and Composition

    P. Zhang, X. D. B. Wang, Y . Cao, C. Xu, L. Ouyang, Z. Zhao, S. Ding, S. Zhang, H. Duan, H. Yan et al. , “Internlm-xcomposer: A vision-language large model for advanced text-image comprehension and composition,” arXiv preprint arXiv:2309.15112 , 2023

  35. [35]

    Deep learning face attributes in the wild,

    Z. Liu, P. Luo, X. Wang, and X. Tang, “Deep learning face attributes in the wild,” in Proceedings of the IEEE international conference on computer vision (ICCV) , 2015, pp. 3730–3738

  36. [36]

    The dollar street dataset: Images representing the geographic and socioeconomic diversity of the world,

    W. A. G. Rojas, S. Diamos, K. R. Kini, D. Kanter, V . J. Reddi, and C. Coleman, “The dollar street dataset: Images representing the geographic and socioeconomic diversity of the world,” in Thirty-sixth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (NIPS), 2022

  37. [37]

    Unequal representation and gender stereotypes in image search results for occupations,

    M. Kay, C. Matuszek, and S. A. Munson, “Unequal representation and gender stereotypes in image search results for occupations,” in Proceedings of the 33rd annual acm conference on human factors in computing systems, 2015, pp. 3819–3828

  38. [38]

    Implicit diversity in image summarization,

    L. E. Celis and V . Keswani, “Implicit diversity in image summarization,” Proceedings of the ACM on Human-Computer Interaction (HCI) , vol. 4, no. CSCW2, pp. 1–28, 2020

  39. [39]

    Fairclip: Harnessing fairness in vision-language learning,

    Y . Luo, M. Shi, M. O. Khan, M. M. Afzal, H. Huang, S. Yuan, Y . Tian, L. Song, A. Kouhana, T. Elze et al. , “Fairclip: Harnessing fairness in vision-language learning,” arXiv preprint arXiv:2403.19949 , 2024

  40. [40]

    Are We on the Right Way for Evaluating Large Vision-Language Models?

    L. Chen, J. Li, X. Dong, P. Zhang, Y . Zang, Z. Chen, H. Duan, J. Wang, Y . Qiao, D. Lin et al. , “Are we on the right way for evaluating large vision-language models?” arXiv preprint arXiv:2403.20330 , 2024

  41. [41]

    Improving image generation with better captions,

    J. Betker, G. Goh, L. Jing, T. Brooks, J. Wang, L. Li, L. Ouyang, J. Zhuang, J. Lee, Y . Guoet al., “Improving image generation with better captions,” Computer Science. https://cdn. openai. com/papers/dall-e-3. pdf, vol. 2, no. 3, p. 8, 2023

  42. [42]

    A systematic survey on deep generative models for graph generation,

    X. Guo and L. Zhao, “A systematic survey on deep generative models for graph generation,” IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI), vol. 45, no. 5, pp. 5370–5390, 2022

  43. [43]

    Fast graph generation via spectral diffu- sion,

    T. Luo, Z. Mo, and S. J. Pan, “Fast graph generation via spectral diffu- sion,” IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI), 2023

  44. [44]

    Local and global gans with semantic-aware upsampling for image generation,

    H. Tang, L. Shao, P. H. Torr, and N. Sebe, “Local and global gans with semantic-aware upsampling for image generation,” IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI) , vol. 45, no. 1, pp. 768–784, 2022

  45. [45]

    Reinforcing generated images via meta-learning for one-shot fine-grained visual recognition,

    S. Tsutsui, Y . Fu, and D. Crandall, “Reinforcing generated images via meta-learning for one-shot fine-grained visual recognition,” IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI) , vol. 46, no. 3, pp. 1455–1463, 2022

  46. [46]

    Fake it till you make it: Learning transferable representations from synthetic imagenet clones,

    M. B. Sarıyıldız, K. Alahari, D. Larlus, and Y . Kalantidis, “Fake it till you make it: Learning transferable representations from synthetic imagenet clones,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , 2023, pp. 8011– 8021

  47. [47]

    Imagenet: A large-scale hierarchical image database,

    J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “Imagenet: A large-scale hierarchical image database,” in 2009 IEEE conference on computer vision and pattern recognition (CVPR) . Ieee, 2009, pp. 248– 255

  48. [48]

    Stereoset: Measuring stereotypical bias in pretrained language models,

    M. Nadeem, A. Bethke, and S. Reddy, “Stereoset: Measuring stereotypical bias in pretrained language models,” arXiv preprint arXiv:2004.09456, 2020

  49. [49]

    A unified framework and dataset for assessing gender bias in vision-language models,

    A. Sathe, P. Jain, and S. Sitaram, “A unified framework and dataset for assessing gender bias in vision-language models,” arXiv preprint arXiv:2402.13636, 2024

  50. [50]

    Seed-bench: Benchmarking multimodal llms with generative comprehension,

    B. Li, R. Wang, G. Wang, Y . Ge, Y . Ge, and Y . Shan, “Seed-bench: Benchmarking multimodal llms with generative comprehension,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023

  51. [51]

    Mmbench: Is your multi-modal model an all- around player?

    Y . Liu, H. Duan, Y . Zhang, B. Li, S. Zhang, W. Zhao, Y . Yuan, J. Wang, C. He, Z. Liu et al. , “Mmbench: Is your multi-modal model an all- around player?” in European Conference on Computer Vision (ECCV) . Springer, 2025, pp. 216–233

  52. [52]

    Lvlm-ehub: A comprehensive evaluation benchmark for large vision-language models

    P. Xu, W. Shao, K. Zhang, P. Gao, S. Liu, M. Lei, F. Meng, S. Huang, Y . Qiao, and P. Luo, “Lvlm-ehub: A comprehensive eval- uation benchmark for large vision-language models,” arXiv preprint arXiv:2306.09265, 2023

  53. [53]

    Fine-grained sentiment classification using bert,

    M. Munikar, S. Shakya, and A. Shrestha, “Fine-grained sentiment classification using bert,” in2019 Artificial Intelligence for Transforming Business and Society (AITB) , vol. 1. IEEE, 2019, pp. 1–5

  54. [54]

    Vader: A parsimonious rule-based model for sentiment analysis of social media text,

    C. Hutto and E. Gilbert, “Vader: A parsimonious rule-based model for sentiment analysis of social media text,” in Proceedings of the international AAAI conference on web and social media (ICWSM) , vol. 8, no. 1, 2014, pp. 216–225

  55. [55]

    Red-teaming the stable diffusion safety filter,

    J. Rando, D. Paleka, D. Lindner, L. Heim, and F. Tram `er, “Red-teaming the stable diffusion safety filter,”arXiv preprint arXiv:2210.04610, 2022

  56. [56]

    Erasing concepts from diffusion models,

    R. Gandikota, J. Materzynska, J. Fiotto-Kaufman, and D. Bau, “Erasing concepts from diffusion models,” in Proceedings of the IEEE/CVF International Conference on Computer Vision (CVPR) , 2023, pp. 2426– 2436

  57. [57]

    Mace: Mass concept erasure in diffusion models,

    S. Lu, Z. Wang, L. Li, Y . Liu, and A. W.-K. Kong, “Mace: Mass concept erasure in diffusion models,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024, pp. 6430–6440

  58. [58]

    Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond

    J. Bai, S. Bai, S. Yang, S. Wang, S. Tan, P. Wang, J. Lin, C. Zhou, and J. Zhou, “Qwen-vl: A versatile vision-language model for un- derstanding, localization, text reading, and beyond,” arXiv preprint arXiv:2308.12966, vol. 1, no. 2, p. 3, 2023

  59. [59]

    Shikra: Unleashing Multimodal LLM's Referential Dialogue Magic

    K. Chen, Z. Zhang, W. Zeng, R. Zhang, F. Zhu, and R. Zhao, “Shikra: Unleashing multimodal llm’s referential dialogue magic,”arXiv preprint arXiv:2306.15195, 2023

  60. [60]

    The proof and measurement of association between two things,

    C. Spearman, “The proof and measurement of association between two things,” The American journal of psychology, vol. 100, no. 3/4, pp. 441– 471, 1987

  61. [61]

    A new measure of rank correlation,

    M. G. Kendall, “A new measure of rank correlation,” Biometrika, vol. 30, no. 1-2, pp. 81–93, 1938

  62. [62]

    Leveraging large lan- guage models for multiple choice question answering,

    J. Robinson, C. M. Rytting, and D. Wingate, “Leveraging large lan- guage models for multiple choice question answering,” in The Twelfth International Conference on Learning Representations (ICLR) , 2023

  63. [63]

    Large language models are not robust multiple choice selectors,

    C. Zheng, H. Zhou, F. Meng, J. Zhou, and M. Huang, “Large language models are not robust multiple choice selectors,” in The Twelfth Inter- national Conference on Learning Representations (ICLR) , 2023. JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2021 14 SUPPLEMENTARY MATERIAL I. D ISCUSSION A. Synthetic Dataset for training / fine-tuning In the f...

  64. [64]

    yes” (shown in Tab. VII as “yes pr

    under BSD 3-Clause license, LLaV A [26] under Apache- 2.0 license, Otter [32] under MIT license, Shikra [59] under CC-BY-NC license, internlm-xcomposer [34] under Apache- 2.0 license, qwen [58] under Tongyi Qianwen license). For the diffusion model, SDXL [28] is under MIT License. C. Causes of Bias Based on our close-ended dataset, we investigate the mod-...

  65. [65]

    draw style

    No, it is not suitable for the person in the picture to use the phone while driving. ------------------------------------------------------------ Fig. 7. Example for illusion-based bias. The model’s response includes information that never appears in the image and context. • Profession: Metalworking occupations, Sewing occupa- tions, Healthcare occupation...