VLBiasBench: A Comprehensive Benchmark for Evaluating Bias in Large Vision-Language Model
Pith reviewed 2026-05-23 23:35 UTC · model grok-4.3
The pith
VLBiasBench supplies a dataset of 128k samples across eleven bias categories to evaluate social biases in large vision-language models.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
VLBiasBench is a benchmark that covers nine social bias categories (age, disability status, gender, nationality, physical appearance, race, religion, profession, social economic status) and two intersectional categories (race x gender, race x social economic status), constructed from 46,848 Stable Diffusion XL images combined with open- and closed-ended questions into 128,342 samples, and applied to fifteen open-source plus two closed-source models to produce new observations on bias patterns.
What carries the argument
The VLBiasBench dataset of generated images paired with bias-directed questions in both open-ended and closed-ended formats.
If this is right
- Evaluations can now address nine social biases plus two intersectional combinations in a single resource.
- Both open-ended and closed-ended questions supply complementary views on how models express bias.
- Results from seventeen models supply concrete observations about bias levels in current open-source and closed-source systems.
- The scale of 128,342 samples supports statistical comparisons across bias types and model families.
Where Pith is reading between the lines
- Future work could use the same image-question structure to test whether fine-tuning or prompting strategies reduce the measured biases.
- The benchmark format might extend to other multimodal systems that process both images and text.
- Large-scale category coverage makes it feasible to study whether bias strength varies systematically with model size or training data.
Load-bearing premise
Images produced by Stable Diffusion XL accurately and neutrally represent the intended social categories so that measured biases can be attributed to the vision-language models rather than to artifacts in the images themselves.
What would settle it
Re-running the evaluations with real photographs instead of generated images and obtaining materially different bias scores would indicate that the original measurements depend on generation artifacts rather than the targeted social categories.
Figures
read the original abstract
The emergence of Large Vision-Language Models (LVLMs) marks significant strides towards achieving general artificial intelligence. However, these advancements are accompanied by concerns about biased outputs, a challenge that has yet to be thoroughly explored. Existing benchmarks are not sufficiently comprehensive in evaluating biases due to their limited data scale, single questioning format and narrow sources of bias. To address this problem, we introduce VLBiasBench, a comprehensive benchmark designed to evaluate biases in LVLMs. VLBiasBench, features a dataset that covers nine distinct categories of social biases, including age, disability status, gender, nationality, physical appearance, race, religion, profession, social economic status, as well as two intersectional bias categories: race x gender and race x social economic status. To build a large-scale dataset, we use Stable Diffusion XL model to generate 46,848 high-quality images, which are combined with various questions to creat 128,342 samples. These questions are divided into open-ended and close-ended types, ensuring thorough consideration of bias sources and a comprehensive evaluation of LVLM biases from multiple perspectives. We conduct extensive evaluations on 15 open-source models as well as two advanced closed-source models, yielding new insights into the biases present in these models. Our benchmark is available at https://github.com/Xiangkui-Cao/VLBiasBench.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces VLBiasBench, a benchmark for evaluating social biases in LVLMs. It covers nine bias categories (age, disability, gender, nationality, physical appearance, race, religion, profession, socioeconomic status) plus two intersectional ones (race×gender, race×socioeconomic status), generated via Stable Diffusion XL to produce 46,848 images and 128,342 question-image samples (open- and closed-ended), and reports evaluations on 17 models yielding new insights into LVLM biases.
Significance. If the generated images can be shown to accurately and cleanly instantiate the target attributes without confounding artifacts, the benchmark would meaningfully expand prior work by its scale, breadth of categories, and dual question formats, providing a reusable resource for systematic LVLM bias measurement.
major comments (2)
- [Dataset construction] Dataset construction (abstract and § on image generation): the central claim that model responses can be attributed to the nine labeled bias categories (plus intersections) rests on the assumption that the 46,848 SDXL-generated images accurately depict the intended attributes without systematic visual confounders; no quantitative validation (human agreement rates, attribute-classifier accuracy on held-out images, or prompt-ablation results) is reported.
- [Evaluation methodology] Evaluation and scoring methodology (abstract and results section): the paper states that evaluations on 15 open-source + 2 closed-source models 'yield new insights,' yet provides no description of the bias-scoring procedure for open-ended responses, inter-annotator agreement, or how closed-ended accuracy is aggregated into bias metrics; this makes the reliability of the reported insights impossible to assess.
minor comments (2)
- [Abstract] Abstract contains a typo: 'creat 128,342 samples' should be 'create 128,342 samples.'
- [Abstract] The abstract states '15 open-source models as well as two advanced closed-source models' (total 17) but the title and introduction should consistently report the exact model count and list the models evaluated.
Simulated Author's Rebuttal
We thank the referee for their constructive feedback on VLBiasBench. The comments identify areas where additional detail would strengthen the paper, and we address each point below with plans for revision.
read point-by-point responses
-
Referee: [Dataset construction] Dataset construction (abstract and § on image generation): the central claim that model responses can be attributed to the nine labeled bias categories (plus intersections) rests on the assumption that the 46,848 SDXL-generated images accurately depict the intended attributes without systematic visual confounders; no quantitative validation (human agreement rates, attribute-classifier accuracy on held-out images, or prompt-ablation results) is reported.
Authors: We agree that quantitative validation of the generated images is necessary to support clean attribution of model responses to the target bias categories. The manuscript describes the use of attribute-specific prompts with SDXL but does not report human agreement rates, classifier accuracy, or ablation results. In the revised version we will add a validation subsection that includes human evaluation on a held-out sample of images (reporting agreement rates with intended attributes) and any available automated checks. This directly addresses the concern. revision: yes
-
Referee: [Evaluation methodology] Evaluation and scoring methodology (abstract and results section): the paper states that evaluations on 15 open-source + 2 closed-source models 'yield new insights,' yet provides no description of the bias-scoring procedure for open-ended responses, inter-annotator agreement, or how closed-ended accuracy is aggregated into bias metrics; this makes the reliability of the reported insights impossible to assess.
Authors: We acknowledge that the absence of a detailed scoring description limits assessment of the reported insights. The current manuscript presents evaluation results without specifying the exact procedure for scoring open-ended responses, any inter-annotator agreement, or the aggregation of closed-ended accuracies into bias metrics. We will expand the evaluation section in the revision to include a full description of the bias-scoring pipeline for both question types, the method used to derive the metrics, and any agreement statistics if human judgment was involved. revision: yes
Circularity Check
No circularity: benchmark construction and evaluations are independent
full rationale
The paper constructs VLBiasBench by generating images via an external model (Stable Diffusion XL) and pairing them with questions across bias categories, then reports evaluations on 17 LVLMs. No equations, fitted parameters, predictions, or derivations appear in the provided text. No self-citations are invoked as load-bearing support for any uniqueness claim or ansatz. The central contribution (dataset scale and coverage) does not reduce to its own inputs by definition or statistical forcing, satisfying the criteria for a self-contained benchmark paper with score 0.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Stable Diffusion XL generates images that faithfully represent the nine social bias categories without introducing confounding visual artifacts or biases.
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/AbsoluteFloorClosure.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We use Stable Diffusion XL model to generate 46,848 high-quality images... combined with various questions to create 128,342 samples... open-ended and close-ended types
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We evaluate... using the Valence Aware Dictionary and Sentiment Reasoner (VADER) to calculate the text’s sentiment score... Range VADER
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 4 Pith papers
-
Causal Bias Detection in Generative Artificial Intelligence
Develops a causal framework unifying generative AI fairness with standard ML, with new decompositions, identification conditions, and estimators demonstrated on LLM race and gender bias.
-
CrossCult-KIBench: A Benchmark for Cross-Cultural Knowledge Insertion in MLLMs
CrossCult-KIBench provides 9,800 test cases for cross-cultural knowledge insertion in MLLMs and shows that existing methods cannot reliably adapt to one culture while preserving behavior in others.
-
CrossCult-KIBench: A Benchmark for Cross-Cultural Knowledge Insertion in MLLMs
CrossCult-KIBench is a new benchmark for evaluating cross-cultural knowledge insertion in MLLMs, paired with the MCKI baseline method, showing current approaches fail to balance adaptation and preservation.
-
Causal Bias Detection in Generative Artificial Intelligence
A causal framework unifies fairness analysis across generative AI and standard ML by deriving decompositions that separate biases along causal pathways and differences between real-world and model mechanisms.
Reference graph
Works this paper leans on
-
[1]
J. Achiam, S. Adler, S. Agarwal, L. Ahmad, I. Akkaya, F. L. Aleman, D. Almeida, J. Altenschmidt, S. Altman, S. Anadkat et al. , “Gpt-4 technical report,” arXiv preprint arXiv:2303.08774 , 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[2]
MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models
D. Zhu, J. Chen, X. Shen, X. Li, and M. Elhoseiny, “Minigpt-4: Enhancing vision-language understanding with advanced large language models,” arXiv preprint arXiv:2304.10592 , 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[3]
H. Liu, C. Li, Q. Wu, and Y . J. Lee, “Visual instruction tuning,” in Advances in neural information processing systems (NIPS) , vol. 36, 2024
work page 2024
-
[4]
Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality,
W.-L. Chiang, Z. Li, Z. Lin, Y . Sheng, Z. Wu, H. Zhang, L. Zheng, S. Zhuang, Y . Zhuang, J. E. Gonzalez et al., “Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality,” See https://vicuna. lmsys. org (accessed 14 April 2023) , vol. 2, no. 3, p. 6, 2023
work page 2023
-
[5]
LLaMA: Open and Efficient Foundation Language Models
H. Touvron, T. Lavril, G. Izacard, X. Martinet, M.-A. Lachaux, T. Lacroix, B. Rozi `ere, N. Goyal, E. Hambro, F. Azhar et al. , “Llama: Open and efficient foundation language models,”arXiv preprint arXiv:2302.13971, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[6]
Are gender-neutral queries really gender-neutral? mitigating gender bias in image search,
J. Wang, Y . Liu, and X. E. Wang, “Are gender-neutral queries really gender-neutral? mitigating gender bias in image search,” arXiv preprint arXiv:2109.05433, 2021
-
[7]
T. Manzini, Y . C. Lim, Y . Tsvetkov, and A. W. Black, “Black is to criminal as caucasian is to police: Detecting and removing multiclass bias in word embeddings,” arXiv preprint arXiv:1904.04047 , 2019
work page internal anchor Pith review Pith/arXiv arXiv 1904
-
[8]
H. Zhang, W. Shao, H. Liu, Y . Ma, P. Luo, Y . Qiao, and K. Zhang, “Avibench: Towards evaluating the robustness of large vision-language model on adversarial visual-instructions,”arXiv preprint arXiv:2403.09346, 2024
-
[9]
Assessment of multimodal large language models in alignment with human values,
Z. Shi, Z. Wang, H. Fan, Z. Zhang, L. Li, Y . Zhang, Z. Yin, L. Sheng, Y . Qiao, and J. Shao, “Assessment of multimodal large language models in alignment with human values,” arXiv preprint arXiv:2403.17830 , 2024
-
[10]
Bold: Dataset and metrics for measuring biases in open-ended language generation,
J. Dhamala, T. Sun, V . Kumar, S. Krishna, Y . Pruksachatkun, K.-W. Chang, and R. Gupta, “Bold: Dataset and metrics for measuring biases in open-ended language generation,” in Proceedings of the 2021 ACM conference on fairness, accountability, and transparency (ACM FAccT) , 2021, pp. 862–872
work page 2021
-
[11]
Bbq: A hand-built bias benchmark for question answering,
A. Parrish, A. Chen, N. Nangia, V . Padmakumar, J. Phang, J. Thompson, P. M. Htut, and S. R. Bowman, “Bbq: A hand-built bias benchmark for question answering,” in Findings of the Association for Computational Linguistics: ACL, 2022
work page 2022
-
[12]
K. Karkkainen and J. Joo, “Fairface: Face attribute dataset for balanced race, gender, and age for bias measurement and mitigation,” in Proceed- ings of the IEEE/CVF winter conference on applications of computer vision (WACV), 2021, pp. 1548–1558
work page 2021
-
[13]
Age progression/regression by con- ditional adversarial autoencoder,
Z. Zhang, Y . Song, and H. Qi, “Age progression/regression by con- ditional adversarial autoencoder,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR) , 2017, pp. 5810–5818
work page 2017
-
[14]
Microsoft coco: Common objects in context,
T.-Y . Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Doll ´ar, and C. L. Zitnick, “Microsoft coco: Common objects in context,” in Proceedings of the European Conference on Computer Vision (ECCV). Springer, 2014, pp. 740–755
work page 2014
-
[15]
Flickr30k entities: Collecting region-to-phrase correspondences for richer image-to-sentence models,
B. A. Plummer, L. Wang, C. M. Cervantes, J. C. Caicedo, J. Hocken- maier, and S. Lazebnik, “Flickr30k entities: Collecting region-to-phrase correspondences for richer image-to-sentence models,” in Proceedings of the IEEE/CVF international conference on computer vision (CVPR) , 2015, pp. 2641–2649
work page 2015
-
[16]
Learning transferable visual models from natural language supervision,
A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark et al., “Learning transferable visual models from natural language supervision,” in International conference on machine learning (ICML). PMLR, 2021, pp. 8748–8763
work page 2021
-
[17]
Measuring and mitigating bias in vision-and- language models,
F. Chen and Z.-Y . Dou, “Measuring and mitigating bias in vision-and- language models,” Information Processing & Management , 2024
work page 2024
-
[18]
Leveraging clip for inferring sen- sitive information and improving model fairness,
M. Zhang and R. Chunara, “Leveraging clip for inferring sen- sitive information and improving model fairness,” arXiv preprint arXiv:2403.10624, 2024
-
[19]
Mitigating test-time bias for fair image retrieval,
F. Kong, S. Yuan, W. Hao, and R. Henao, “Mitigating test-time bias for fair image retrieval,” Advances in Neural Information Processing Systems (NIPS), vol. 36, 2024
work page 2024
-
[20]
Dear: Debiasing vision-language models with additive residuals,
A. Seth, M. Hemani, and C. Agarwal, “Dear: Debiasing vision-language models with additive residuals,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023, pp. 6820–6829
work page 2023
-
[21]
S. Janghorbani and G. De Melo, “Multimodal bias: Introducing a framework for stereotypical bias assessment beyond gender and race in vision language models,” arXiv preprint arXiv:2303.12734 , 2023
-
[22]
Vlstereoset: A study of stereotypical bias in pre-trained vision-language models,
K. Zhou, Y . LAI, and J. Jiang, “Vlstereoset: A study of stereotypical bias in pre-trained vision-language models,” in Annual Meeting of the Association for Computational Linguistics (ACL) , 2022
work page 2022
-
[23]
Visogender: A dataset for benchmarking gender bias in image-text pronoun resolution,
S. M. Hall, F. Gonc ¸alves Abrantes, H. Zhu, G. Sodunke, A. Shtedritski, and H. R. Kirk, “Visogender: A dataset for benchmarking gender bias in image-text pronoun resolution,” Advances in Neural Information Processing Systems (NIPS) , vol. 36, 2024
work page 2024
-
[24]
P. Howard, A. Madasu, T. Le, G. L. Moreno, A. Bhiwandiwalla, and V . Lal, “Socialcounterfactuals: Probing and mitigating intersectional social biases in vision-language models with counterfactual examples,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024, pp. 11 975–11 985
work page 2024
-
[25]
An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale
A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly et al., “An image is worth 16x16 words: Transformers for image recognition at scale,” arXiv preprint arXiv:2010.11929 , 2020
work page internal anchor Pith review Pith/arXiv arXiv 2010
-
[26]
Improved baselines with visual instruction tuning,
H. Liu, C. Li, Y . Li, and Y . J. Lee, “Improved baselines with visual instruction tuning,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , 2024, pp. 26 296– 26 306. JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2021 13
work page 2024
-
[27]
Gemini: A Family of Highly Capable Multimodal Models
G. Team, R. Anil, S. Borgeaud, Y . Wu, J.-B. Alayrac, J. Yu, R. Soricut, J. Schalkwyk, A. M. Dai, A. Hauth et al., “Gemini: a family of highly capable multimodal models,” arXiv preprint arXiv:2312.11805 , 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[28]
SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis
D. Podell, Z. English, K. Lacey, A. Blattmann, T. Dockhorn, J. M ¨uller, J. Penna, and R. Rombach, “Sdxl: Improving latent diffusion models for high-resolution image synthesis,” arXiv preprint arXiv:2307.01952 , 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[29]
Vision-language models for vision tasks: A survey,
J. Zhang, J. Huang, S. Jin, and S. Lu, “Vision-language models for vision tasks: A survey,” IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI), 2024
work page 2024
-
[30]
J. Li, D. Li, S. Savarese, and S. Hoi, “Blip-2: Bootstrapping language- image pre-training with frozen image encoders and large language models,” in International conference on machine learning (ICML) . PMLR, 2023, pp. 19 730–19 742
work page 2023
-
[31]
MiniGPT-v2: large language model as a unified interface for vision-language multi-task learning
J. Chen, D. Zhu, X. Shen, X. Li, Z. Liu, P. Zhang, R. Krishnamoorthi, V . Chandra, Y . Xiong, and M. Elhoseiny, “Minigpt-v2: large language model as a unified interface for vision-language multi-task learning,” arXiv preprint arXiv:2310.09478 , 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[32]
Mimic-it: Multi-modal in-context instruction tuning,
B. Li, Y . Zhang, L. Chen, J. Wang, F. Pu, J. Yang, C. Li, and Z. Liu, “Mimic-it: Multi-modal in-context instruction tuning,” arXiv preprint arXiv:2306.05425, 2023
-
[33]
Instructblip: Towards general-purpose vision- language models with instruction tuning,
W. Dai, J. Li, D. Li, A. M. H. Tiong, J. Zhao, W. Wang, B. Li, P. N. Fung, and S. Hoi, “Instructblip: Towards general-purpose vision- language models with instruction tuning,” Advances in Neural Informa- tion Processing Systems (NIPS) , vol. 36, 2024
work page 2024
-
[34]
P. Zhang, X. D. B. Wang, Y . Cao, C. Xu, L. Ouyang, Z. Zhao, S. Ding, S. Zhang, H. Duan, H. Yan et al. , “Internlm-xcomposer: A vision-language large model for advanced text-image comprehension and composition,” arXiv preprint arXiv:2309.15112 , 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[35]
Deep learning face attributes in the wild,
Z. Liu, P. Luo, X. Wang, and X. Tang, “Deep learning face attributes in the wild,” in Proceedings of the IEEE international conference on computer vision (ICCV) , 2015, pp. 3730–3738
work page 2015
-
[36]
W. A. G. Rojas, S. Diamos, K. R. Kini, D. Kanter, V . J. Reddi, and C. Coleman, “The dollar street dataset: Images representing the geographic and socioeconomic diversity of the world,” in Thirty-sixth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (NIPS), 2022
work page 2022
-
[37]
Unequal representation and gender stereotypes in image search results for occupations,
M. Kay, C. Matuszek, and S. A. Munson, “Unequal representation and gender stereotypes in image search results for occupations,” in Proceedings of the 33rd annual acm conference on human factors in computing systems, 2015, pp. 3819–3828
work page 2015
-
[38]
Implicit diversity in image summarization,
L. E. Celis and V . Keswani, “Implicit diversity in image summarization,” Proceedings of the ACM on Human-Computer Interaction (HCI) , vol. 4, no. CSCW2, pp. 1–28, 2020
work page 2020
-
[39]
Fairclip: Harnessing fairness in vision-language learning,
Y . Luo, M. Shi, M. O. Khan, M. M. Afzal, H. Huang, S. Yuan, Y . Tian, L. Song, A. Kouhana, T. Elze et al. , “Fairclip: Harnessing fairness in vision-language learning,” arXiv preprint arXiv:2403.19949 , 2024
-
[40]
Are We on the Right Way for Evaluating Large Vision-Language Models?
L. Chen, J. Li, X. Dong, P. Zhang, Y . Zang, Z. Chen, H. Duan, J. Wang, Y . Qiao, D. Lin et al. , “Are we on the right way for evaluating large vision-language models?” arXiv preprint arXiv:2403.20330 , 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[41]
Improving image generation with better captions,
J. Betker, G. Goh, L. Jing, T. Brooks, J. Wang, L. Li, L. Ouyang, J. Zhuang, J. Lee, Y . Guoet al., “Improving image generation with better captions,” Computer Science. https://cdn. openai. com/papers/dall-e-3. pdf, vol. 2, no. 3, p. 8, 2023
work page 2023
-
[42]
A systematic survey on deep generative models for graph generation,
X. Guo and L. Zhao, “A systematic survey on deep generative models for graph generation,” IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI), vol. 45, no. 5, pp. 5370–5390, 2022
work page 2022
-
[43]
Fast graph generation via spectral diffu- sion,
T. Luo, Z. Mo, and S. J. Pan, “Fast graph generation via spectral diffu- sion,” IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI), 2023
work page 2023
-
[44]
Local and global gans with semantic-aware upsampling for image generation,
H. Tang, L. Shao, P. H. Torr, and N. Sebe, “Local and global gans with semantic-aware upsampling for image generation,” IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI) , vol. 45, no. 1, pp. 768–784, 2022
work page 2022
-
[45]
Reinforcing generated images via meta-learning for one-shot fine-grained visual recognition,
S. Tsutsui, Y . Fu, and D. Crandall, “Reinforcing generated images via meta-learning for one-shot fine-grained visual recognition,” IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI) , vol. 46, no. 3, pp. 1455–1463, 2022
work page 2022
-
[46]
Fake it till you make it: Learning transferable representations from synthetic imagenet clones,
M. B. Sarıyıldız, K. Alahari, D. Larlus, and Y . Kalantidis, “Fake it till you make it: Learning transferable representations from synthetic imagenet clones,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , 2023, pp. 8011– 8021
work page 2023
-
[47]
Imagenet: A large-scale hierarchical image database,
J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “Imagenet: A large-scale hierarchical image database,” in 2009 IEEE conference on computer vision and pattern recognition (CVPR) . Ieee, 2009, pp. 248– 255
work page 2009
-
[48]
Stereoset: Measuring stereotypical bias in pretrained language models,
M. Nadeem, A. Bethke, and S. Reddy, “Stereoset: Measuring stereotypical bias in pretrained language models,” arXiv preprint arXiv:2004.09456, 2020
-
[49]
A unified framework and dataset for assessing gender bias in vision-language models,
A. Sathe, P. Jain, and S. Sitaram, “A unified framework and dataset for assessing gender bias in vision-language models,” arXiv preprint arXiv:2402.13636, 2024
-
[50]
Seed-bench: Benchmarking multimodal llms with generative comprehension,
B. Li, R. Wang, G. Wang, Y . Ge, Y . Ge, and Y . Shan, “Seed-bench: Benchmarking multimodal llms with generative comprehension,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023
work page 2023
-
[51]
Mmbench: Is your multi-modal model an all- around player?
Y . Liu, H. Duan, Y . Zhang, B. Li, S. Zhang, W. Zhao, Y . Yuan, J. Wang, C. He, Z. Liu et al. , “Mmbench: Is your multi-modal model an all- around player?” in European Conference on Computer Vision (ECCV) . Springer, 2025, pp. 216–233
work page 2025
-
[52]
Lvlm-ehub: A comprehensive evaluation benchmark for large vision-language models
P. Xu, W. Shao, K. Zhang, P. Gao, S. Liu, M. Lei, F. Meng, S. Huang, Y . Qiao, and P. Luo, “Lvlm-ehub: A comprehensive eval- uation benchmark for large vision-language models,” arXiv preprint arXiv:2306.09265, 2023
-
[53]
Fine-grained sentiment classification using bert,
M. Munikar, S. Shakya, and A. Shrestha, “Fine-grained sentiment classification using bert,” in2019 Artificial Intelligence for Transforming Business and Society (AITB) , vol. 1. IEEE, 2019, pp. 1–5
work page 2019
-
[54]
Vader: A parsimonious rule-based model for sentiment analysis of social media text,
C. Hutto and E. Gilbert, “Vader: A parsimonious rule-based model for sentiment analysis of social media text,” in Proceedings of the international AAAI conference on web and social media (ICWSM) , vol. 8, no. 1, 2014, pp. 216–225
work page 2014
-
[55]
Red-teaming the stable diffusion safety filter,
J. Rando, D. Paleka, D. Lindner, L. Heim, and F. Tram `er, “Red-teaming the stable diffusion safety filter,”arXiv preprint arXiv:2210.04610, 2022
-
[56]
Erasing concepts from diffusion models,
R. Gandikota, J. Materzynska, J. Fiotto-Kaufman, and D. Bau, “Erasing concepts from diffusion models,” in Proceedings of the IEEE/CVF International Conference on Computer Vision (CVPR) , 2023, pp. 2426– 2436
work page 2023
-
[57]
Mace: Mass concept erasure in diffusion models,
S. Lu, Z. Wang, L. Li, Y . Liu, and A. W.-K. Kong, “Mace: Mass concept erasure in diffusion models,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024, pp. 6430–6440
work page 2024
-
[58]
Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond
J. Bai, S. Bai, S. Yang, S. Wang, S. Tan, P. Wang, J. Lin, C. Zhou, and J. Zhou, “Qwen-vl: A versatile vision-language model for un- derstanding, localization, text reading, and beyond,” arXiv preprint arXiv:2308.12966, vol. 1, no. 2, p. 3, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[59]
Shikra: Unleashing Multimodal LLM's Referential Dialogue Magic
K. Chen, Z. Zhang, W. Zeng, R. Zhang, F. Zhu, and R. Zhao, “Shikra: Unleashing multimodal llm’s referential dialogue magic,”arXiv preprint arXiv:2306.15195, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[60]
The proof and measurement of association between two things,
C. Spearman, “The proof and measurement of association between two things,” The American journal of psychology, vol. 100, no. 3/4, pp. 441– 471, 1987
work page 1987
-
[61]
A new measure of rank correlation,
M. G. Kendall, “A new measure of rank correlation,” Biometrika, vol. 30, no. 1-2, pp. 81–93, 1938
work page 1938
-
[62]
Leveraging large lan- guage models for multiple choice question answering,
J. Robinson, C. M. Rytting, and D. Wingate, “Leveraging large lan- guage models for multiple choice question answering,” in The Twelfth International Conference on Learning Representations (ICLR) , 2023
work page 2023
-
[63]
Large language models are not robust multiple choice selectors,
C. Zheng, H. Zhou, F. Meng, J. Zhou, and M. Huang, “Large language models are not robust multiple choice selectors,” in The Twelfth Inter- national Conference on Learning Representations (ICLR) , 2023. JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2021 14 SUPPLEMENTARY MATERIAL I. D ISCUSSION A. Synthetic Dataset for training / fine-tuning In the f...
work page 2023
-
[64]
yes” (shown in Tab. VII as “yes pr
under BSD 3-Clause license, LLaV A [26] under Apache- 2.0 license, Otter [32] under MIT license, Shikra [59] under CC-BY-NC license, internlm-xcomposer [34] under Apache- 2.0 license, qwen [58] under Tongyi Qianwen license). For the diffusion model, SDXL [28] is under MIT License. C. Causes of Bias Based on our close-ended dataset, we investigate the mod-...
work page 2067
-
[65]
No, it is not suitable for the person in the picture to use the phone while driving. ------------------------------------------------------------ Fig. 7. Example for illusion-based bias. The model’s response includes information that never appears in the image and context. • Profession: Metalworking occupations, Sewing occupa- tions, Healthcare occupation...
work page 2021
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.