SDGBiasBench: Benchmarking and Mitigating Vision--Language Models' Biases in Sustainable Development Goals

Hongyuan Zhu; Huaiyuan Qin; Muli Yang; Zihang Lin

arxiv: 2605.21919 · v1 · pith:7V7UHX5Onew · submitted 2026-05-21 · 💻 cs.CV · cs.AI

SDGBiasBench: Benchmarking and Mitigating Vision--Language Models' Biases in Sustainable Development Goals

Zihang Lin , Huaiyuan Qin , Muli Yang , Hongyuan Zhu This is my paper

Pith reviewed 2026-05-22 07:25 UTC · model grok-4.3

classification 💻 cs.CV cs.AI

keywords vision-language modelsbias evaluationsustainable development goalsbenchmark suitedebiasing methodmulti-modal reasoningregression tasks

0 comments

The pith

Vision-language models often substitute SDG-specific priors for visual and contextual evidence when assessing sustainable development tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper creates SDGBiasBench, a large collection of multiple-choice questions and regression tasks focused on the Sustainable Development Goals, to measure how vision-language models perform on real monitoring work. Tests show that models frequently lean on learned associations with particular goals instead of properly weighing the images and text provided. The authors introduce a training-free method called CADE that counters this by contrasting answers across modalities. If the findings hold, it would mean current models need explicit correction before they can be trusted for quantitative or qualitative SDG evaluations. This matters because biased outputs could distort progress tracking on global targets like poverty reduction or climate action.

Core claim

Evaluations on SDGBiasBench reveal an intrinsic SDG bias in current VLMs, where predictions are frequently driven by SDG specific priors rather than reliable multi-modal cues. To mitigate such bias, CADE leverages modality-specific answer priors in a training-free, plug-and-play manner and yields significant gains, improving multiple-choice accuracy by up to 25% and reducing regression MAE by up to 12 points across multiple VLMs.

What carries the argument

SDGBiasBench, a benchmark with 500k expert-involved multiple-choice questions and 50k regression tasks that isolates reliance on SDG priors, paired with CADE which ensembles contrastive modality-specific priors to reduce that reliance.

If this is right

Both qualitative judgments and quantitative estimations in SDG tasks can be improved simultaneously with the same debiasing step.
Training-free adjustments suffice to produce measurable lifts in accuracy and error reduction on this scale of benchmark.
Model outputs become more dependent on the actual image-text pair once modality-specific priors are contrasted.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same prior-substitution pattern could appear in VLMs applied to other specialized domains such as medical imaging or legal document analysis.
Extending the benchmark to include temporal sequences of images might reveal whether biases strengthen or weaken with additional context.

Load-bearing premise

The expert-involved questions and regression tasks accurately represent real-world SDG monitoring scenarios without introducing their own systematic biases.

What would settle it

If the same models achieve comparable accuracy on a version of the benchmark where SDG labels are randomly reassigned to images while keeping visual content fixed, the claim of intrinsic model priors would be undermined.

Figures

Figures reproduced from arXiv: 2605.21919 by Hongyuan Zhu, Huaiyuan Qin, Muli Yang, Zihang Lin.

**Figure 2.** Figure 2: Overview of SDGBiasBench. The three sustainability pillars are each illustrated with one qualitative judgment and one quantitative estimation example, showcasing the multi-modal SDG reasoning tasks used to probe SDG biases in VLMs. multiple-choice questions for qualitative judgments and regression tasks for quantitative estimation, respectively. Each task is paired with satellite imagery, structured cont… view at source ↗

**Figure 3.** Figure 3: Per-view MCQ Accuracy. Accuracy (%) for three VLMs under four evidence views (Q-only, CTX+Q, IMG+Q, Full). LLaVA-v1.5 InstructBLIP Qwen2.5-VL 0 25 50 75 100 Proportion (%) Pillar 1 Pillar 2 Pillar 3 Pillar 1 Pillar 2 Pillar 3 Pillar 1 Pillar 2 Pillar 3 Optimistic Conservative Pessimistic [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

**Figure 4.** Figure 4: shows that these priors are not only model-specific but also pillar-dependent, forming distinctive bias signatures across Health & Nutrition, Basic Services & Infrastructure, and Human Capital & Development. Concretely, each model exhibits a characteristic triplet of outcome distributions across the three pillars, revealing where it systematically leans pessimistic, anchors to the middle, or defaults opt… view at source ↗

**Figure 5.** Figure 5: Performance difference under different input views. Results of ∆Acc (Full − CTX+Q) show consistent modality imbalance. tal. QWEN2.5-VL-7B shows a pronounced pessimistic leaning on Pillar 1 and Pillar 2: for both pillars, the pessimistic category constitutes a large portion of outputs (roughly half or more), with the remainder largely optimistic and little reliance on the conservative option. The pattern … view at source ↗

**Figure 7.** Figure 7: Performance of Vision–Language Models on: (Upper) multiple-choice questions; (Bottom) regression questions. "Ours" refers to applying CADE to the same base model, where reductions are highlighted with green [PITH_FULL_IMAGE:figures/full_fig_p007_7.png] view at source ↗

**Figure 8.** Figure 8: Hyperparameter sensitivity study on LLAVA-V1.5 and INSTRUCTBLIP. Each subplot varies one specific hyperparameter while fixing the other three. MCQ accuracy is reported as VLM’s performance [PITH_FULL_IMAGE:figures/full_fig_p008_8.png] view at source ↗

read the original abstract

Assessing progress toward the Sustainable Development Goals (SDGs) requires multi-step reasoning over visual cues, contextual knowledge, and development indicators, where incomplete evidence use and imperfect evidence integration can introduce hidden prediction biases. Real-world SDG monitoring further spans both qualitative judgments and quantitative estimation. However, existing benchmarks typically evaluate these aspects in isolation, obscuring systematic biases that emerge when models substitute priors for evidence. To address this gap, we propose SDGBiasBench, a large-scale benchmark suite for SDG-oriented vision-language reasoning. Spanning 500k expert-involved multiple-choice questions and 50k regression tasks, the benchmark enables comprehensive assessment of both decision-level and estimation-level bias in Vision--Language Models (VLMs). Evaluations on SDGBiasBench reveal an intrinsic SDG bias in current VLMs, where predictions are frequently driven by SDG specific priors rather than reliable multi-modal cues. To mitigate such bias, we propose CADE (Contrastive Adaptive Debias Ensemble), a training-free, plug-and-play method that leverages modality-specific answer priors. CADE yields significant gains on the proposed benchmark, improving multiple-choice accuracy by up to 25% and reducing regression MAE by up to 12 points across multiple VLMs. We hope our work can foster the development of more fair and reliable AI systems for sustainable development.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

SDGBiasBench and CADE target a real gap in VLM evaluation for SDG tasks but the bias claims rest on benchmark details that are not yet fully checked.

read the letter

The punchline is that SDGBiasBench flags a plausible issue with VLMs substituting SDG knowledge for visual reasoning, and CADE provides a training-free way to counter it, but the benchmark construction leaves room for the measured effects to be partly self-induced. What the paper does well is identify a practical need. SDG monitoring mixes images, context, and numbers, and most VLM tests do not stress that combination. By building 500k MCQs and 50k regression items with expert input, they create a dataset that can check both accuracy and bias at volume. The reported improvements from CADE across models suggest the method has some generality, and the plug-and-play nature makes it easy to adopt. The soft spots sit in the validation of the benchmark itself. The abstract mentions expert-involved questions but gives no breakdown on how options were chosen, whether images were necessary for correct answers, or any language-only baselines. If many items can be answered from the question text or common knowledge alone, then the gap between standard inference and the debiasing step might overstate the role of model priors. The regression tasks face a similar issue around how ground truth was set and whether visual cues were the deciding factor. Without those checks, the central claim about intrinsic bias rests on thinner ground than the scale suggests. This paper is for people who evaluate or deploy VLMs in domains that affect public decisions, such as climate or development tracking. A reader looking for new test sets or quick debiasing ideas will find usable material here. The thinking is straightforward and the problem is well-motivated. It deserves a serious referee. The topic has downstream stakes, and the size of the benchmark is a genuine addition even if the current evidence needs more controls to be fully convincing.

Referee Report

2 major / 2 minor

Summary. The paper introduces SDGBiasBench, a large-scale benchmark with 500k expert-involved multiple-choice questions and 50k regression tasks targeting SDG-oriented vision-language reasoning in VLMs. It claims current VLMs exhibit intrinsic SDG bias by substituting SDG-specific priors for multi-modal cues, and proposes the training-free CADE (Contrastive Adaptive Debias Ensemble) method, which reportedly improves multiple-choice accuracy by up to 25% and reduces regression MAE by up to 12 points across multiple VLMs.

Significance. If the benchmark construction demonstrably isolates model priors from visual evidence and the CADE gains are attributable to debiasing rather than benchmark artifacts, the work could support more reliable VLM use in SDG monitoring applications. The scale (500k/50k tasks) is a potential strength for comprehensive evaluation, but this hinges on rigorous validation of the tasks as proxies for real-world multi-modal reasoning.

major comments (2)

[Benchmark Construction] Benchmark construction section: The paper describes questions and tasks as 'expert-involved' but provides no quantitative validation (e.g., language-only solvability rates on a held-out subset, inter-annotator agreement metrics, or checks for answer-option priors that encode SDG knowledge). Without these, it is unclear whether observed performance gaps reflect VLM substitution of priors for multi-modal cues or artifacts of question phrasing/image selection, directly undermining the central claim of 'intrinsic SDG bias'.
[Experiments] Evaluation and CADE results section: The reported improvements (up to 25% MC accuracy, 12-point MAE reduction) are given without statistical significance testing, variance across runs, or ablation against non-debiasing baselines (e.g., simple prompt engineering or modality weighting). This makes it difficult to confirm that gains stem specifically from leveraging 'modality-specific answer priors' as intended by CADE rather than other factors.

minor comments (2)

[Abstract] The abstract and introduction could more explicitly define 'SDG specific priors' with an example from the benchmark to aid reader understanding.
[Results] Table or figure captions for the main results should include exact VLM names, dataset splits, and confidence intervals to improve reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive comments, which help clarify areas where the manuscript can be strengthened. We address each major comment below and will revise the paper accordingly to improve the presentation of benchmark validation and experimental results.

read point-by-point responses

Referee: [Benchmark Construction] Benchmark construction section: The paper describes questions and tasks as 'expert-involved' but provides no quantitative validation (e.g., language-only solvability rates on a held-out subset, inter-annotator agreement metrics, or checks for answer-option priors that encode SDG knowledge). Without these, it is unclear whether observed performance gaps reflect VLM substitution of priors for multi-modal cues or artifacts of question phrasing/image selection, directly undermining the central claim of 'intrinsic SDG bias'.

Authors: We agree that the manuscript would benefit from explicit quantitative validation metrics to support the benchmark's ability to isolate model priors from visual evidence. The current description notes expert involvement in constructing the 500k multiple-choice questions and 50k regression tasks but does not report the requested metrics. In the revised version, we will expand the benchmark construction section to include language-only solvability rates on a held-out subset, inter-annotator agreement statistics, and analyses of answer-option priors. These additions will directly address whether performance gaps arise from intrinsic SDG bias or from question/image artifacts. revision: yes
Referee: [Experiments] Evaluation and CADE results section: The reported improvements (up to 25% MC accuracy, 12-point MAE reduction) are given without statistical significance testing, variance across runs, or ablation against non-debiasing baselines (e.g., simple prompt engineering or modality weighting). This makes it difficult to confirm that gains stem specifically from leveraging 'modality-specific answer priors' as intended by CADE rather than other factors.

Authors: We concur that including statistical tests, run variance, and targeted ablations would strengthen the attribution of CADE's gains to its contrastive debiasing mechanism. The manuscript reports the accuracy and MAE improvements across VLMs but does not present these supporting analyses. We will revise the evaluation and CADE results section to add statistical significance testing, standard deviations or variance across runs, and ablations against non-debiasing baselines such as prompt engineering and modality weighting. This will help confirm that the reported gains arise specifically from leveraging modality-specific answer priors. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical benchmark construction and plug-and-play mitigation method

full rationale

The paper introduces SDGBiasBench as a new dataset of expert-involved MCQs and regression tasks, then reports VLM performance and proposes the training-free CADE ensemble. No equations, fitted parameters, or derivations are present. The central claims rest on direct evaluation results rather than any self-referential reduction, self-citation chain, or renaming of prior results. The benchmark and method are self-contained against external model testing.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

Review based on abstract only; the central claims rest on unstated assumptions about benchmark fidelity and model behavior that cannot be fully audited without the full manuscript.

axioms (1)

domain assumption Expert-involved questions accurately capture multi-step SDG reasoning without introducing confounding biases.
Invoked implicitly to support the claim that observed model errors reflect intrinsic VLM biases rather than benchmark artifacts.

invented entities (1)

CADE (Contrastive Adaptive Debias Ensemble) no independent evidence
purpose: Training-free mitigation of SDG priors in VLMs via modality-specific answer adjustments.
New method introduced to address the identified bias.

pith-pipeline@v0.9.0 · 5778 in / 1490 out tokens · 64583 ms · 2026-05-22T07:25:43.809505+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Evaluations on SDGBiasBench reveal an intrinsic SDG bias in current VLMs, where predictions are frequently driven by SDG specific priors rather than reliable multi-modal cues.
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

CADE ... leverages modality-specific answer priors ... improving multiple-choice accuracy by up to 25%

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

60 extracted references · 60 canonical work pages · 7 internal anchors

[1]

Advances in neural information processing systems , volume =

Flamingo: a visual language model for few-shot learning , author =. Advances in neural information processing systems , volume =

work page
[2]

2024 , url =

Claude 3.5 Sonnet Model Card , author =. 2024 , url =

work page 2024
[3]

Proceedings of the IEEE international conference on computer vision , pages =

Vqa: Visual question answering , author =. Proceedings of the IEEE international conference on computer vision , pages =

work page
[4]

Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing , pages =

Benchmarking and Mitigating MCQA Selection Bias of Large Vision-Language Models , author =. Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing , pages =

work page 2025
[5]

Qwen2.5-VL Technical Report

Qwen2. 5-vl technical report , author =. arXiv preprint arXiv:2502.13923 , year =

work page internal anchor Pith review Pith/arXiv arXiv
[6]

Proceedings of the 2021 ACM conference on fairness, accountability, and transparency , pages =

On the dangers of stochastic parrots: Can language models be too big? , author =. Proceedings of the 2021 ACM conference on fairness, accountability, and transparency , pages =

work page 2021
[7]

arXiv preprint arXiv:2005.14050 , year =

Language (technology) is power: A critical survey of" bias" in nlp , author =. arXiv preprint arXiv:2005.14050 , year =

work page arXiv 2005
[8]

On the Opportunities and Risks of Foundation Models

On the opportunities and risks of foundation models , author=. arXiv preprint arXiv:2108.07258 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[9]

arXiv preprint arXiv:2312.10114 , year =

FoMo-Bench: a multi-modal, multi-scale and multi-task Forest Monitoring Benchmark for remote sensing foundation models , author =. arXiv preprint arXiv:2312.10114 , year =

work page arXiv
[10]

Science , volume =

Using satellite imagery to understand and promote sustainable development , author =. Science , volume =. 2021 , publisher =

work page 2021
[11]

PaLI-X: On Scaling up a Multilingual Vision and Language Model

Pali-x: On scaling up a multilingual vision and language model , author =. arXiv preprint arXiv:2305.18565 , year =

work page internal anchor Pith review Pith/arXiv arXiv
[12]

Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling

Expanding performance boundaries of open-source multimodal models with model, data, and test-time scaling , author =. arXiv preprint arXiv:2412.05271 , year =

work page internal anchor Pith review Pith/arXiv arXiv
[13]

InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models

Internvl3: Exploring advanced training and test-time recipes for open-source multimodal models , author =. arXiv preprint arXiv:2504.10479 , year =

work page internal anchor Pith review Pith/arXiv arXiv
[14]

The Twelfth International Conference on Learning Representations , year =

DoLa: Decoding by Contrasting Layers Improves Factuality in Large Language Models , author =. The Twelfth International Conference on Learning Representations , year =

work page
[15]

International journal of epidemiology , volume =

Demographic and health surveys: a profile , author =. International journal of epidemiology , volume =. 2012 , publisher =

work page 2012
[16]

Advances in neural information processing systems , volume =

Instructblip: Towards general-purpose vision-language models with instruction tuning , author =. Advances in neural information processing systems , volume =

work page
[17]

Proceedings of the IEEE/CVF International Conference on Computer Vision , pages =

Geobench-vlm: Benchmarking vision-language models for geospatial tasks , author =. Proceedings of the IEEE/CVF International Conference on Computer Vision , pages =

work page
[18]

The Thirty-ninth Annual Conference on Neural Information Processing Systems Datasets and Benchmarks Track , year =

Mme: A comprehensive evaluation benchmark for multimodal large language models , author =. The Thirty-ninth Annual Conference on Neural Information Processing Systems Datasets and Benchmarks Track , year =

work page
[19]

2024 , url =

Gemini 2.0: Unlocking New Capabilities in Multimodal AI , author =. 2024 , url =

work page 2024
[20]

Advances in neural information processing systems , volume=

Equality of opportunity in supervised learning , author=. Advances in neural information processing systems , volume=

work page
[21]

Proceedings of the Ninth International Conference on Information and Communication Technologies and Development , pages =

Can human development be measured with satellite imagery? , author =. Proceedings of the Ninth International Conference on Information and Communication Technologies and Development , pages =

work page
[22]

American economic review , volume=

Measuring economic growth from outer space , author=. American economic review , volume=. 2012 , publisher=

work page 2012
[23]

Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing , pages =

Ai sees your location—but with a bias toward the wealthy world , author =. Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing , pages =

work page 2025
[24]

arXiv preprint arXiv:2503.07575 , year =

VisBias: Measuring Explicit and Implicit Social Biases in Vision Language Models , author =. arXiv preprint arXiv:2503.07575 , year =

work page arXiv
[25]

Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages =

Gqa: A new dataset for real-world visual reasoning and compositional question answering , author =. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages =

work page
[26]

Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics , pages =

Multi-modal bias: Introducing a framework for stereotypical bias assessment beyond gender and race in vision--language models , author =. Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics , pages =

work page
[27]

Science , volume =

Combining satellite imagery and machine learning to predict poverty , author =. Science , volume =. 2016 , doi =

work page 2016
[28]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages =

Geochat: Grounded large vision-language model for remote sensing , author =. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages =

work page
[29]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages =

Mitigating object hallucinations in large vision-language models through visual contrastive decoding , author =. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages =

work page
[30]

International conference on machine learning , pages =

Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models , author =. International conference on machine learning , pages =. 2023 , organization =

work page 2023
[31]

Proceedings of the 61st annual meeting of the association for computational linguistics (volume 1: Long papers) , pages =

Contrastive decoding: Open-ended text generation as optimization , author =. Proceedings of the 61st annual meeting of the association for computational linguistics (volume 1: Long papers) , pages =

work page
[32]

Evaluating Object Hallucination in Large Vision-Language Models

Evaluating object hallucination in large vision-language models , author =. arXiv preprint arXiv:2305.10355 , year =

work page internal anchor Pith review Pith/arXiv arXiv
[33]

LLaVA-OneVision: Easy Visual Task Transfer

Llava-onevision: Easy visual task transfer , author =. arXiv preprint arXiv:2408.03326 , year =

work page internal anchor Pith review Pith/arXiv arXiv
[34]

arXiv preprint arXiv:2306.01879 , year =

Revisiting the role of language priors in vision-language models , author =. arXiv preprint arXiv:2306.01879 , year =

work page arXiv
[35]

Advances in neural information processing systems , volume =

Visual instruction tuning , author =. Advances in neural information processing systems , volume =

work page
[36]

Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages =

Improved baselines with visual instruction tuning , author =. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages =

work page
[37]

Advances in Neural Information Processing Systems , volume =

Learn to explain: Multimodal reasoning via thought chains for science question answering , author =. Advances in Neural Information Processing Systems , volume =

work page
[38]

arXiv:2402.02680, 2024

Large language models are geographically biased , author =. arXiv preprint arXiv:2402.02680 , year =

work page arXiv
[39]

Pangaea: A global and inclusive benchmark for geospatial foundation models.arXiv preprint arXiv:2412.04204, 2024

Pangaea: A global and inclusive benchmark for geospatial foundation models , author =. arXiv preprint arXiv:2412.04204 , year =

work page arXiv
[40]

Proceedings of the conference on fairness, accountability, and transparency , pages =

Model cards for model reporting , author =. Proceedings of the conference on fairness, accountability, and transparency , pages =

work page
[41]

StereoSet: Measuring stereotypical bias in pretrained language models , author =. Proceedings of the 59th annual meeting of the association for computational linguistics and the 11th international joint conference on natural language processing (volume 1: long papers) , pages =

work page
[42]

Proceedings of the 2020 conference on empirical methods in natural language processing (EMNLP) , pages =

CrowS-pairs: A challenge dataset for measuring social biases in masked language models , author =. Proceedings of the 2020 conference on empirical methods in natural language processing (EMNLP) , pages =

work page 2020
[43]

Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages =

Counterfactual vqa: A cause-effect look at language bias , author =. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages =

work page
[44]

2024 , url =

Hello GPT-4o , author =. 2024 , url =

work page 2024
[45]

Advances in Neural Information Processing Systems , volume =

No filter: Cultural and socioeconomic diversity in contrastive vision-language models , author =. Advances in Neural Information Processing Systems , volume =

work page
[46]

International conference on machine learning , pages =

Learning transferable visual models from natural language supervision , author =. International conference on machine learning , pages =. 2021 , organization =

work page 2021
[47]

Proceedings of the IEEE/CVF International Conference on Computer Vision , pages =

Debias your large multi-modal model at test-time with non-contrastive visual attribute steering , author =. Proceedings of the IEEE/CVF International Conference on Computer Vision , pages =

work page
[48]

Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies , pages =

Measuring social biases in grounded vision and language embeddings , author =. Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies , pages =

work page 2021
[49]

Findings of the Association for Computational Linguistics: ACL 2023 , year =

A multi-dimensional study on bias in vision-language models , author =. Findings of the Association for Computational Linguistics: ACL 2023 , year =

work page 2023
[50]

Proceedings of the Computer Vision and Pattern Recognition Conference , pages =

Earthdial: Turning multi-sensory earth observations to interactive dialogues , author =. Proceedings of the Computer Vision and Pattern Recognition Conference , pages =

work page
[51]

Transforming our world: the 2030 Agenda for Sustainable Development , year =

work page 2030
[52]

Mitigating hallucinations in large vision-language models with instruction contrastive decoding

Mitigating hallucinations in large vision-language models with instruction contrastive decoding , author =. arXiv preprint arXiv:2403.18715 , year =

work page arXiv
[53]

Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing , pages =

Images speak louder than words: Understanding and mitigating bias in vision-language model from a causal mediation perspective , author =. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing , pages =

work page 2024
[54]

arXiv preprint arXiv:2407.02814 , year =

Images Speak Louder than Words: Understanding and Mitigating Bias in Vision-Language Model from a Causal Mediation Perspective , author =. arXiv preprint arXiv:2407.02814 , year =

work page arXiv
[55]

Nature communications , volume =

Using publicly available satellite imagery and deep learning to understand economic well-being in Africa , author =. Nature communications , volume =. 2020 , publisher =

work page 2020
[56]

Sustainbench: Bench- marks for monitoring the sustainable development goals with machine learning

Sustainbench: Benchmarks for monitoring the sustainable development goals with machine learning , author =. arXiv preprint arXiv:2111.04724 , year =

work page arXiv
[57]

Proceedings of the 30th ACM International Conference on Multimedia , pages =

Counterfactually measuring and eliminating social bias in vision-language pre-training models , author =. Proceedings of the 30th ACM International Conference on Multimedia , pages =

work page
[58]

Proceedings of the 33rd ACM International Conference on Multimedia , pages =

Debiasing multimodal large language models via penalization of language priors , author =. Proceedings of the 33rd ACM International Conference on Multimedia , pages =

work page
[59]

Vlstereoset: A study of stereotypical bias in pre-trained vision-language models , author =. Proceedings of the 2nd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 12th International Joint Conference on Natural Language Processing (Volume 1: Long Papers) , pages =

work page
[60]

Proceedings of the Computer Vision and Pattern Recognition Conference , pages =

Ibd: Alleviating hallucinations in large vision-language models via image-biased decoding , author =. Proceedings of the Computer Vision and Pattern Recognition Conference , pages =

work page

[1] [1]

Advances in neural information processing systems , volume =

Flamingo: a visual language model for few-shot learning , author =. Advances in neural information processing systems , volume =

work page

[2] [2]

2024 , url =

Claude 3.5 Sonnet Model Card , author =. 2024 , url =

work page 2024

[3] [3]

Proceedings of the IEEE international conference on computer vision , pages =

Vqa: Visual question answering , author =. Proceedings of the IEEE international conference on computer vision , pages =

work page

[4] [4]

Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing , pages =

Benchmarking and Mitigating MCQA Selection Bias of Large Vision-Language Models , author =. Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing , pages =

work page 2025

[5] [5]

Qwen2.5-VL Technical Report

Qwen2. 5-vl technical report , author =. arXiv preprint arXiv:2502.13923 , year =

work page internal anchor Pith review Pith/arXiv arXiv

[6] [6]

Proceedings of the 2021 ACM conference on fairness, accountability, and transparency , pages =

On the dangers of stochastic parrots: Can language models be too big? , author =. Proceedings of the 2021 ACM conference on fairness, accountability, and transparency , pages =

work page 2021

[7] [7]

arXiv preprint arXiv:2005.14050 , year =

Language (technology) is power: A critical survey of" bias" in nlp , author =. arXiv preprint arXiv:2005.14050 , year =

work page arXiv 2005

[8] [8]

On the Opportunities and Risks of Foundation Models

On the opportunities and risks of foundation models , author=. arXiv preprint arXiv:2108.07258 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[9] [9]

arXiv preprint arXiv:2312.10114 , year =

FoMo-Bench: a multi-modal, multi-scale and multi-task Forest Monitoring Benchmark for remote sensing foundation models , author =. arXiv preprint arXiv:2312.10114 , year =

work page arXiv

[10] [10]

Science , volume =

Using satellite imagery to understand and promote sustainable development , author =. Science , volume =. 2021 , publisher =

work page 2021

[11] [11]

PaLI-X: On Scaling up a Multilingual Vision and Language Model

Pali-x: On scaling up a multilingual vision and language model , author =. arXiv preprint arXiv:2305.18565 , year =

work page internal anchor Pith review Pith/arXiv arXiv

[12] [12]

Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling

Expanding performance boundaries of open-source multimodal models with model, data, and test-time scaling , author =. arXiv preprint arXiv:2412.05271 , year =

work page internal anchor Pith review Pith/arXiv arXiv

[13] [13]

InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models

Internvl3: Exploring advanced training and test-time recipes for open-source multimodal models , author =. arXiv preprint arXiv:2504.10479 , year =

work page internal anchor Pith review Pith/arXiv arXiv

[14] [14]

The Twelfth International Conference on Learning Representations , year =

DoLa: Decoding by Contrasting Layers Improves Factuality in Large Language Models , author =. The Twelfth International Conference on Learning Representations , year =

work page

[15] [15]

International journal of epidemiology , volume =

Demographic and health surveys: a profile , author =. International journal of epidemiology , volume =. 2012 , publisher =

work page 2012

[16] [16]

Advances in neural information processing systems , volume =

Instructblip: Towards general-purpose vision-language models with instruction tuning , author =. Advances in neural information processing systems , volume =

work page

[17] [17]

Proceedings of the IEEE/CVF International Conference on Computer Vision , pages =

Geobench-vlm: Benchmarking vision-language models for geospatial tasks , author =. Proceedings of the IEEE/CVF International Conference on Computer Vision , pages =

work page

[18] [18]

The Thirty-ninth Annual Conference on Neural Information Processing Systems Datasets and Benchmarks Track , year =

Mme: A comprehensive evaluation benchmark for multimodal large language models , author =. The Thirty-ninth Annual Conference on Neural Information Processing Systems Datasets and Benchmarks Track , year =

work page

[19] [19]

2024 , url =

Gemini 2.0: Unlocking New Capabilities in Multimodal AI , author =. 2024 , url =

work page 2024

[20] [20]

Advances in neural information processing systems , volume=

Equality of opportunity in supervised learning , author=. Advances in neural information processing systems , volume=

work page

[21] [21]

Proceedings of the Ninth International Conference on Information and Communication Technologies and Development , pages =

Can human development be measured with satellite imagery? , author =. Proceedings of the Ninth International Conference on Information and Communication Technologies and Development , pages =

work page

[22] [22]

American economic review , volume=

Measuring economic growth from outer space , author=. American economic review , volume=. 2012 , publisher=

work page 2012

[23] [23]

Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing , pages =

Ai sees your location—but with a bias toward the wealthy world , author =. Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing , pages =

work page 2025

[24] [24]

arXiv preprint arXiv:2503.07575 , year =

VisBias: Measuring Explicit and Implicit Social Biases in Vision Language Models , author =. arXiv preprint arXiv:2503.07575 , year =

work page arXiv

[25] [25]

Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages =

Gqa: A new dataset for real-world visual reasoning and compositional question answering , author =. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages =

work page

[26] [26]

Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics , pages =

Multi-modal bias: Introducing a framework for stereotypical bias assessment beyond gender and race in vision--language models , author =. Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics , pages =

work page

[27] [27]

Science , volume =

Combining satellite imagery and machine learning to predict poverty , author =. Science , volume =. 2016 , doi =

work page 2016

[28] [28]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages =

Geochat: Grounded large vision-language model for remote sensing , author =. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages =

work page

[29] [29]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages =

Mitigating object hallucinations in large vision-language models through visual contrastive decoding , author =. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages =

work page

[30] [30]

International conference on machine learning , pages =

Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models , author =. International conference on machine learning , pages =. 2023 , organization =

work page 2023

[31] [31]

Proceedings of the 61st annual meeting of the association for computational linguistics (volume 1: Long papers) , pages =

Contrastive decoding: Open-ended text generation as optimization , author =. Proceedings of the 61st annual meeting of the association for computational linguistics (volume 1: Long papers) , pages =

work page

[32] [32]

Evaluating Object Hallucination in Large Vision-Language Models

Evaluating object hallucination in large vision-language models , author =. arXiv preprint arXiv:2305.10355 , year =

work page internal anchor Pith review Pith/arXiv arXiv

[33] [33]

LLaVA-OneVision: Easy Visual Task Transfer

Llava-onevision: Easy visual task transfer , author =. arXiv preprint arXiv:2408.03326 , year =

work page internal anchor Pith review Pith/arXiv arXiv

[34] [34]

arXiv preprint arXiv:2306.01879 , year =

Revisiting the role of language priors in vision-language models , author =. arXiv preprint arXiv:2306.01879 , year =

work page arXiv

[35] [35]

Advances in neural information processing systems , volume =

Visual instruction tuning , author =. Advances in neural information processing systems , volume =

work page

[36] [36]

Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages =

Improved baselines with visual instruction tuning , author =. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages =

work page

[37] [37]

Advances in Neural Information Processing Systems , volume =

Learn to explain: Multimodal reasoning via thought chains for science question answering , author =. Advances in Neural Information Processing Systems , volume =

work page

[38] [38]

arXiv:2402.02680, 2024

Large language models are geographically biased , author =. arXiv preprint arXiv:2402.02680 , year =

work page arXiv

[39] [39]

Pangaea: A global and inclusive benchmark for geospatial foundation models.arXiv preprint arXiv:2412.04204, 2024

Pangaea: A global and inclusive benchmark for geospatial foundation models , author =. arXiv preprint arXiv:2412.04204 , year =

work page arXiv

[40] [40]

Proceedings of the conference on fairness, accountability, and transparency , pages =

Model cards for model reporting , author =. Proceedings of the conference on fairness, accountability, and transparency , pages =

work page

[41] [41]

StereoSet: Measuring stereotypical bias in pretrained language models , author =. Proceedings of the 59th annual meeting of the association for computational linguistics and the 11th international joint conference on natural language processing (volume 1: long papers) , pages =

work page

[42] [42]

Proceedings of the 2020 conference on empirical methods in natural language processing (EMNLP) , pages =

CrowS-pairs: A challenge dataset for measuring social biases in masked language models , author =. Proceedings of the 2020 conference on empirical methods in natural language processing (EMNLP) , pages =

work page 2020

[43] [43]

Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages =

Counterfactual vqa: A cause-effect look at language bias , author =. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages =

work page

[44] [44]

2024 , url =

Hello GPT-4o , author =. 2024 , url =

work page 2024

[45] [45]

Advances in Neural Information Processing Systems , volume =

No filter: Cultural and socioeconomic diversity in contrastive vision-language models , author =. Advances in Neural Information Processing Systems , volume =

work page

[46] [46]

International conference on machine learning , pages =

Learning transferable visual models from natural language supervision , author =. International conference on machine learning , pages =. 2021 , organization =

work page 2021

[47] [47]

Proceedings of the IEEE/CVF International Conference on Computer Vision , pages =

Debias your large multi-modal model at test-time with non-contrastive visual attribute steering , author =. Proceedings of the IEEE/CVF International Conference on Computer Vision , pages =

work page

[48] [48]

Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies , pages =

Measuring social biases in grounded vision and language embeddings , author =. Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies , pages =

work page 2021

[49] [49]

Findings of the Association for Computational Linguistics: ACL 2023 , year =

A multi-dimensional study on bias in vision-language models , author =. Findings of the Association for Computational Linguistics: ACL 2023 , year =

work page 2023

[50] [50]

Proceedings of the Computer Vision and Pattern Recognition Conference , pages =

Earthdial: Turning multi-sensory earth observations to interactive dialogues , author =. Proceedings of the Computer Vision and Pattern Recognition Conference , pages =

work page

[51] [51]

Transforming our world: the 2030 Agenda for Sustainable Development , year =

work page 2030

[52] [52]

Mitigating hallucinations in large vision-language models with instruction contrastive decoding

Mitigating hallucinations in large vision-language models with instruction contrastive decoding , author =. arXiv preprint arXiv:2403.18715 , year =

work page arXiv

[53] [53]

Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing , pages =

Images speak louder than words: Understanding and mitigating bias in vision-language model from a causal mediation perspective , author =. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing , pages =

work page 2024

[54] [54]

arXiv preprint arXiv:2407.02814 , year =

Images Speak Louder than Words: Understanding and Mitigating Bias in Vision-Language Model from a Causal Mediation Perspective , author =. arXiv preprint arXiv:2407.02814 , year =

work page arXiv

[55] [55]

Nature communications , volume =

Using publicly available satellite imagery and deep learning to understand economic well-being in Africa , author =. Nature communications , volume =. 2020 , publisher =

work page 2020

[56] [56]

Sustainbench: Bench- marks for monitoring the sustainable development goals with machine learning

Sustainbench: Benchmarks for monitoring the sustainable development goals with machine learning , author =. arXiv preprint arXiv:2111.04724 , year =

work page arXiv

[57] [57]

Proceedings of the 30th ACM International Conference on Multimedia , pages =

Counterfactually measuring and eliminating social bias in vision-language pre-training models , author =. Proceedings of the 30th ACM International Conference on Multimedia , pages =

work page

[58] [58]

Proceedings of the 33rd ACM International Conference on Multimedia , pages =

Debiasing multimodal large language models via penalization of language priors , author =. Proceedings of the 33rd ACM International Conference on Multimedia , pages =

work page

[59] [59]

Vlstereoset: A study of stereotypical bias in pre-trained vision-language models , author =. Proceedings of the 2nd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 12th International Joint Conference on Natural Language Processing (Volume 1: Long Papers) , pages =

work page

[60] [60]

Proceedings of the Computer Vision and Pattern Recognition Conference , pages =

Ibd: Alleviating hallucinations in large vision-language models via image-biased decoding , author =. Proceedings of the Computer Vision and Pattern Recognition Conference , pages =

work page