arxiv: 2601.06931 · v2 · submitted 2026-01-11 · 💻 cs.CV · cs.AI· cs.CL

Recognition: 1 theorem link

· Lean Theorem

Measuring Social Bias in Vision-Language Models with Face-Only Counterfactuals from Real Photos

Haodong Chen , Qiang Huang , Jiaqi Zhao , Qiuping Jiang , Xiaojun Chang , Jun Yu

Authors on Pith no claims yet

Pith reviewed 2026-05-16 14:52 UTC · model grok-4.3

classification 💻 cs.CV cs.AIcs.CL

keywords social biasvision-language modelscounterfactual evaluationdemographic disparitiesface editingbenchmark dataset

0 comments

The pith

Vision-language models show demographic disparities in decisions even when only facial race and gender are changed in real photos.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a face-only counterfactual method to measure social bias in vision-language models by editing only the race and gender attributes of faces in real photographs while holding every other visual element fixed. This isolates whether models respond differently to demographic cues alone. The authors create the FOCUS dataset of 480 matched images spanning six occupations and ten demographic groups, then test five state-of-the-art models on the REFLECT benchmark of three decision tasks. Results indicate that demographic disparities remain detectable under this strict control and that the size and direction of those disparities shift markedly when the task formulation changes from forced choice to salary estimation.

Core claim

Demographic disparities persist in vision-language model outputs under strict visual control achieved by face-only counterfactuals from real photographs, and these disparities vary substantially across task formulations.

What carries the argument

The face-only counterfactual evaluation paradigm, which generates scene-matched variants by editing only facial attributes related to race and gender while keeping background, clothing, pose, and lighting fixed.

If this is right

Demographic disparities remain detectable even when all non-face visual factors are held constant.
The magnitude and direction of bias change substantially when the same underlying decision is framed as a forced choice versus a numeric salary recommendation.
Controlled counterfactual image sets are required to separate demographic effects from correlated visual confounders.
Task formulation must be treated as a variable when auditing social bias in multimodal models.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The method could be extended to other editable attributes such as age or expression to test whether similar task-dependent patterns appear.
Models may be relying on subtle facial cues as proxies for unobservable traits like competence when making occupational judgments.
Audits that ignore task wording risk under- or over-estimating bias in downstream applications.

Load-bearing premise

Editing only facial attributes related to race and gender in real photographs preserves visual realism and does not introduce artifacts that themselves influence model outputs.

What would settle it

If the same five models produce statistically identical outputs on paired counterfactual images that differ solely in the edited facial race or gender attributes, the claim of persistent demographic disparities under visual control would be falsified.

Figures

Figures reproduced from arXiv: 2601.06931 by Haodong Chen, Jiaqi Zhao, Jun Yu, Qiang Huang, Qiuping Jiang, Xiaojun Chang.

**Figure 2.** Figure 2: Overview of REFLECT with FOCUS dataset construction. Starting from real photos, we generate scene-matched counterfactuals by editing only facial demographic cues while keeping all other context fixed. Using these controlled images, we evaluate VLMs with three decision-oriented tasks: (1) 2AFC, head-to-head comparisons between paired counterfactuals from the same source photo; (2) MCQ, single-image categori… view at source ↗

**Figure 3.** Figure 3: FOCUS example from one source photo. Ten face-only counterfactual variants (5 races × 2 genders) generated from the same real source photo, illustrating the visual control used in REFLECT. gender while remaining matched in all other respects. To meet this need, we construct FOCUS, a real-photo counterfactual dataset that generates scene-matched variants by editing only facial demographic cues while prese… view at source ↗

**Figure 4.** Figure 4: 2AFC results for Gemini-2.5-Pro on FOCUS. (a–c) Pairwise win-rate matrices over the 10 race–gender groups for Income, Education, and Perceived Safety; each cell shows the fraction of retained comparisons in which the row group is selected over the column group. Groups are abbreviated by race (A/B/L/ME/W) × gender (M/F). (d–e) Race win rates within male (d) and female (e) variants, reported separately for e… view at source ↗

**Figure 5.** Figure 5: MCQ results on FOCUS. Mean-based percentage gaps ∆g relative to reference groups (White for race; Female for gender). indicating that some models shift the shape of the answer distribution more than others. • Salary MCQ largely follows a Whiteadvantaged race pattern. Using White as the reference, most non-White groups show negative salary gaps ( [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗

**Figure 6.** Figure 6: Salary recommendation on FOCUS. Mean percentage gaps by occupation (race vs. White; gender: Male vs. Female). 5 Conclusions We study social bias in VLMs and address the key challenge of attributing disparities under visual confounding in real photographs. We introduce FOCUS, a real-photo face-only counterfactual dataset, and propose REFLECT, a decisionoriented benchmark evaluating VLMs across compleme… view at source ↗

**Figure 7.** Figure 7: Face-only counterfactual editing prompt used in FOCUS. We apply the same instruction template to every source photo and vary only {TARGET_GROUP} (one of the ten race–gender groups). The prompt constrains scene context and photographic properties to remain unchanged while allowing only subtle, face-localized edits required by the target demographic description, and returns a single edited image per request.… view at source ↗

**Figure 8.** Figure 8: 2AFC prompts used in REFLECT. We evaluate three scenarios (Income, Education, and Perceived Safety). In each case, the model is shown two counterfactual versions of the same FOCUS source photo, labeled A and B, and must answer with exactly one letter. 14 [PITH_FULL_IMAGE:figures/full_fig_p014_8.png] view at source ↗

**Figure 9.** Figure 9: MCQ prompts used in REFLECT. The model is shown one FOCUS image and must output a single option letter for Salary (A–F) or Education (A–D), with no additional text. You are a hiring manager of a company. Below is the biography of a candidate who has applied for the position of {occupation} at your company. Based on this information, recommend an annual salary in U.S dollars for this candidate as an integer… view at source ↗

**Figure 10.** Figure 10: Salary recommendation prompt used in REFLECT. The model is provided an occupation title and a candidate biography and must output a single integer annual salary in USD, with no additional text [PITH_FULL_IMAGE:figures/full_fig_p015_10.png] view at source ↗

**Figure 11.** Figure 11: FOCUS examples for CEO. 15 [PITH_FULL_IMAGE:figures/full_fig_p015_11.png] view at source ↗

**Figure 12.** Figure 12: FOCUS examples for nurse [PITH_FULL_IMAGE:figures/full_fig_p016_12.png] view at source ↗

**Figure 13.** Figure 13: FOCUS examples for lawyer [PITH_FULL_IMAGE:figures/full_fig_p016_13.png] view at source ↗

**Figure 14.** Figure 14: FOCUS examples for cook. 16 [PITH_FULL_IMAGE:figures/full_fig_p016_14.png] view at source ↗

**Figure 15.** Figure 15: FOCUS examples for doctor. AM AF BM BF LM LF MEM MEF WM WF AM AF BM BF LM LF MEM MEF WM WF 0.0 0.2 0.4 0.6 0.8 1.0 (a) GPT Income. AM AF BM BF LM LF MEM MEF WM WF AM AF BM BF LM LF MEM MEF WM WF 0.0 0.2 0.4 0.6 0.8 1.0 (b) GPT Education. AM AF BM BF LM LF MEM MEF WM WF AM AF BM BF LM LF MEM MEF WM WF 0.0 0.2 0.4 0.6 0.8 1.0 (c) GPT Perceived safety. Education Income Safety 0 20 40 60 80 100 Win Rate (\%) … view at source ↗

**Figure 16.** Figure 16: 2AFC results for GPT-5 on FOCUS. 17 [PITH_FULL_IMAGE:figures/full_fig_p017_16.png] view at source ↗

**Figure 17.** Figure 17: 2AFC results for Llama3.2-90B-Vision-Instruct on FOCUS. AM AF BM BF LM LF MEM MEF WM WF AM AF BM BF LM LF MEM MEF WM WF 0.0 0.2 0.4 0.6 0.8 1.0 (a) Qwen Income. AM AF BM BF LM LF MEM MEF WM WF AM AF BM BF LM LF MEM MEF WM WF 0.0 0.2 0.4 0.6 0.8 1.0 (b) Qwen Education. AM AF BM BF LM LF MEM MEF WM WF AM AF BM BF LM LF MEM MEF WM WF 0.0 0.2 0.4 0.6 0.8 1.0 (c) Qwen Perceived safety. Education Income Safety … view at source ↗

**Figure 18.** Figure 18: 2AFC results for Qwen3-VL-Plus on FOCUS. 18 [PITH_FULL_IMAGE:figures/full_fig_p018_18.png] view at source ↗

read the original abstract

Vision-Language Models (VLMs) are increasingly deployed in socially consequential settings, raising concerns about social bias driven by demographic cues. A central challenge in measuring such social bias is attribution under visual confounding: real-world images entangle race and gender with correlated factors such as background and clothing, obscuring attribution. We propose a \textbf{face-only counterfactual evaluation paradigm} that isolates demographic effects while preserving real-image realism. Starting from real photographs, we generate counterfactual variants by editing only facial attributes related to race and gender, keeping all other visual factors fixed. Based on this paradigm, we construct \textbf{FOCUS}, a dataset of 480 scene-matched counterfactual images across six occupations and ten demographic groups, and propose \textbf{REFLECT}, a benchmark comprising three decision-oriented tasks: two-alternative forced choice, multiple-choice socioeconomic inference, and numeric salary recommendation. Experiments on five state-of-the-art VLMs reveal that demographic disparities persist under strict visual control and vary substantially across task formulations. These findings underscore the necessity of controlled, counterfactual audits and highlight task design as a critical factor in evaluating social bias in multimodal models.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper builds a new real-photo counterfactual dataset to test demographic bias in VLMs and finds disparities hold across tasks, but skips any checks on whether the face edits are clean.

read the letter

The one thing to know is that this paper gives a practical way to create scene-matched counterfactual images by editing only race and gender facial attributes in real photos, then measures how five VLMs respond on three decision tasks. The FOCUS set has 480 images across occupations, and REFLECT includes forced-choice, socioeconomic inference, and salary recommendation. Results show demographic gaps persist even with other visual elements fixed, and the size of the gaps shifts with task format. That part is useful because it moves beyond fully synthetic images or uncontrolled real photos. The controlled setup makes attribution cleaner than many earlier bias audits. The experiments are direct and cover multiple models, which helps show the effect is not tied to one architecture. The soft spot is the complete absence of validation on the edits themselves. The abstract and description give no perceptual similarity scores, no human realism ratings, and no control tests to confirm that non-demographic inferences stay unchanged. If the edits introduce correlated changes in texture, lighting, or identity cues, models could latch onto those instead of the intended demographic signals. That leaves the central claim on weaker ground than it needs to be. This is for people who audit or build fairness tools for deployed VLMs. A reader who needs concrete data and task templates for their own tests will get value from the dataset and benchmark. It is grounded enough to deserve peer review rather than a desk reject, though any referee will want quantitative evidence that the counterfactuals are artifact-free.

Referee Report

1 major / 2 minor

Summary. The paper proposes a face-only counterfactual evaluation paradigm for isolating demographic effects (race and gender) in VLMs by editing only facial attributes in real photographs while holding all other visual factors fixed. It introduces the FOCUS dataset (480 scene-matched counterfactual images across six occupations and ten demographic groups) and the REFLECT benchmark (three decision-oriented tasks: two-alternative forced choice, multiple-choice socioeconomic inference, and numeric salary recommendation). Experiments on five state-of-the-art VLMs show that demographic disparities persist under this visual control and vary substantially across task formulations.

Significance. If the counterfactual edits are shown to be artifact-free, the work provides a useful controlled method for attributing bias to demographic cues rather than correlated visual confounders, advancing beyond standard real-image evaluations. The observation that bias magnitude depends on task formulation is a substantive finding that could inform more robust auditing protocols for multimodal models.

major comments (1)

[Abstract and §3] Abstract and §3 (FOCUS construction): the central claim that 'demographic disparities persist under strict visual control' is load-bearing on the assumption that editing only race/gender facial attributes produces no systematic artifacts (e.g., skin texture, lighting, or identity leakage) that VLMs could exploit. No perceptual similarity metrics, human realism ratings, or control experiments demonstrating unchanged non-demographic inferences are reported, leaving attribution to demographic signals unverified.

minor comments (2)

[§4] §4 (REFLECT tasks): the prompt templates and exact wording for the three decision tasks are described at a high level; providing the full prompt text used for each VLM would improve reproducibility.
[Table 1 and Figure 2] Table 1 and Figure 2: axis labels and legend entries use inconsistent demographic group abbreviations; a single consistent notation table would aid readability.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their constructive feedback, which highlights an important aspect of validating our counterfactual editing approach. We address the major comment below and commit to revisions that will strengthen the manuscript.

read point-by-point responses

Referee: [Abstract and §3] Abstract and §3 (FOCUS construction): the central claim that 'demographic disparities persist under strict visual control' is load-bearing on the assumption that editing only race/gender facial attributes produces no systematic artifacts (e.g., skin texture, lighting, or identity leakage) that VLMs could exploit. No perceptual similarity metrics, human realism ratings, or control experiments demonstrating unchanged non-demographic inferences are reported, leaving attribution to demographic signals unverified.

Authors: We agree that explicit validation of edit quality is necessary to support attribution of disparities solely to demographic cues. The FOCUS images were generated via a targeted face-editing pipeline designed to modify only race- and gender-related facial attributes while preserving background, lighting, pose, clothing, and scene context. Nevertheless, the original submission did not include quantitative perceptual metrics or human validation. In the revised manuscript we will add: (1) LPIPS and SSIM scores between each original-counterfactual pair, computed both globally and on non-face regions to quantify unintended changes; (2) results from a human study in which raters evaluate photorealism and perceived identity consistency of the edited images; and (3) control experiments in which the same VLMs perform non-demographic tasks (e.g., object recognition or scene-attribute inference) on the counterfactual pairs to verify that non-demographic inferences remain stable. These additions will directly address the concern and provide stronger evidence that observed biases arise from demographic signals rather than editing artifacts. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical dataset construction and model evaluation

full rationale

The paper constructs the FOCUS dataset of counterfactual images by editing real photographs and evaluates five VLMs on the REFLECT benchmark tasks. No mathematical derivations, parameter fitting, or predictions are presented that reduce to the inputs by construction. The central claims derive from direct measurement of model outputs on the new images rather than from any self-referential definitions, fitted inputs renamed as predictions, or load-bearing self-citations. The work is self-contained empirical auditing with no circular steps.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that face-only edits isolate demographic effects without introducing confounding visual artifacts; no free parameters or invented entities are introduced.

axioms (1)

domain assumption Face editing can isolate race and gender attributes without affecting other visual factors or introducing artifacts that influence model responses
Invoked as the basis for the counterfactual paradigm in the abstract.

pith-pipeline@v0.9.0 · 5514 in / 1212 out tokens · 62511 ms · 2026-05-16T14:52:45.434884+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean; IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction; washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We propose a face-only counterfactual evaluation paradigm that isolates demographic effects while preserving real-image realism... construct FOCUS, a dataset of 480 scene-matched counterfactual images... REFLECT... three decision-oriented tasks

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

18 extracted references · 18 canonical work pages · 5 internal anchors

[1]

Qwen3-VL Technical Report

Qwen3-vl technical re- port.Preprint, arXiv:2511.21631. Yonatan Bitton, Hritik Bansal, Jack Hessel, Rulin Shao, Wanrong Zhu, Anas Awadalla, Josh Gardner, Rohan Taori, and Ludwig Schimdt

work page internal anchor Pith review Pith/arXiv arXiv
[2]

Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities.arXiv preprint arXiv:2507.06261. Marta Costa-jussà, Pierre Andrews, Eric Smith, Prangthip Hansanti, Christophe Ropers, Elahe Kalbassi, Cynthia Gao, Daniel Licht, and Carleigh Wood

work page internal anchor Pith review Pith/arXiv arXiv
[3]

InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 14141–14156

Multilingual holistic bias: Extending descriptors and patterns to unveil demographic bi- ases in languages at scale. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 14141–14156. Maria De-Arteaga, Alexey Romanov, Hanna Wal- lach, Jennifer Chayes, Christian Borgs, Alexandra Chouldechova, Sahin Geyik, Krishnara...

work page 2023
[4]

InProceedings of the 2021 ACM conference on fairness, accountability, and transparency, pages 862–872

Bold: Dataset and metrics for measuring biases in open-ended language genera- tion. InProceedings of the 2021 ACM conference on fairness, accountability, and transparency, pages 862–872. Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Amy Yang, Angela Fan, and 1 others

work page 2021
[5]

The Llama 3 Herd of Models

The llama 3 herd of models.arXiv preprint arXiv:2407.21783. Kathleen C Fraser and Svetlana Kiritchenko

work page internal anchor Pith review Pith/arXiv arXiv
[6]

InFindings of the Association for Computational Linguistics: EMNLP 2020, pages 3356–3369

Realtoxici- typrompts: Evaluating neural toxic degeneration in language models. InFindings of the Association for Computational Linguistics: EMNLP 2020, pages 3356–3369. Phillip Howard, Kathleen C Fraser, Anahita Bhiwandi- walla, and Svetlana Kiritchenko

work page 2020
[7]

InProceedings of the 2025 Confer- ence of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL), pages 5946–5991

Uncovering bias in large vision-language models at scale with counterfactuals. InProceedings of the 2025 Confer- ence of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL), pages 5946–5991. Phillip Howard, Avinash Madasu, Tiep Le, Gustavo Lu- jan Moreno, Anahita Bhiwandiwalla, and Vasudev Lal

work page 2025
[8]

InProceedings of the 2025 Conference on Empirical Methods in Natural Lan- guage Processing (EMNLP), pages 17981–18004

Vis- Bias: Measuring explicit and implicit social biases in vision language models. InProceedings of the 2025 Conference on Empirical Methods in Natural Lan- guage Processing (EMNLP), pages 17981–18004. Matt J Kusner, Joshua Loftus, Chris Russell, and Ri- cardo Silva

work page 2025
[9]

MediaPipe: A Framework for Building Perception Pipelines

Mediapipe: A framework for building perception pipelines.arXiv preprint arXiv:1906.08172. Moin Nadeem, Anna Bethke, and Siva Reddy

work page internal anchor Pith review Pith/arXiv arXiv 1906
[10]

InProceedings of the 2020 conference on empirical methods in natural language processing (EMNLP), pages 1953–1967

Crows-pairs: A challenge dataset for measuring social biases in masked lan- guage models. InProceedings of the 2020 conference on empirical methods in natural language processing (EMNLP), pages 1953–1967. Huy Nghiem, John Prindle, Jieyu Zhao, and Hal Daumé Iii

work page 2020
[11]

You Gotta be a Doctor, Lin

“You Gotta be a Doctor, Lin”: An Investiga- tion of Name-Based Bias of Large Language Models in Employment Recommendations. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 7268–

work page 2024
[12]

https://cdn

Gpt-5 system card. https://cdn. openai.com/gpt-5-system-card.pdf . Accessed: 2026-01-02. Alicia Parrish, Angelica Chen, Nikita Nangia, Vishakh Padmakumar, Jason Phang, Jana Thompson, Phu Mon Htut, and Samuel Bowman

work page 2026
[13]

InFindings of the Association for Computational Linguistics: ACL 2022, pages 2086–2105

Bbq: A hand-built bias benchmark for question answering. InFindings of the Association for Computational Linguistics: ACL 2022, pages 2086–2105. Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sas- try, Amanda Askell, Pamela Mishkin, Jack Clark, and 1 others

work page 2022
[14]

InProceedings of the 2011 IEEE Con- ference on Computer Vision and Pattern Recognition, pages 1521–1528

Unbiased look at dataset bias. InProceedings of the 2011 IEEE Con- ference on Computer Vision and Pattern Recognition, pages 1521–1528. Zhiyu Wu, Xiaokang Chen, Zizheng Pan, Xingchao Liu, Wen Liu, Damai Dai, Huazuo Gao, Yiyang Ma, Chengyue Wu, Bingxuan Wang, and 1 oth- ers

work page 2011
[15]

DeepSeek-VL2: Mixture-of-Experts Vision-Language Models for Advanced Multimodal Understanding

Deepseek-vl2: Mixture-of-experts vision- language models for advanced multimodal under- standing.arXiv preprint arXiv:2412.10302. Yi Zhang, Junyang Wang, and Jitao Sang

work page internal anchor Pith review Pith/arXiv arXiv
[16]

InProceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 2979–2989

Men also like shopping: Reducing gender bias amplification us- ing corpus-level constraints. InProceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 2979–2989. Jieyu Zhao, Tianlu Wang, Mark Yatskar, Vicente Or- donez, and Kai-Wei Chang

work page 2017
[17]

Gender bias in 10 coreference resolution: Evaluation and debiasing methods. InProceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers), pages 15–20. Kankan Zhou, Eason Lai, and Jing Jiang

work page 2018
[18]

A” or “B

C.2 Significance Tests of Salary Recommendation We complement the mean gap visualizations with regression-based, cluster-robust significance tests. While Figure 6 summarizes effectmagnitudevia mean absolute gaps, Table 3 tests forsystem- atic signed shiftsacross demographic conditions at the unit level, using standard errors clustered by unit (defined by ...

work page arXiv