Recognition: 1 theorem link
· Lean TheoremMeasuring Social Bias in Vision-Language Models with Face-Only Counterfactuals from Real Photos
Pith reviewed 2026-05-16 14:52 UTC · model grok-4.3
The pith
Vision-language models show demographic disparities in decisions even when only facial race and gender are changed in real photos.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Demographic disparities persist in vision-language model outputs under strict visual control achieved by face-only counterfactuals from real photographs, and these disparities vary substantially across task formulations.
What carries the argument
The face-only counterfactual evaluation paradigm, which generates scene-matched variants by editing only facial attributes related to race and gender while keeping background, clothing, pose, and lighting fixed.
If this is right
- Demographic disparities remain detectable even when all non-face visual factors are held constant.
- The magnitude and direction of bias change substantially when the same underlying decision is framed as a forced choice versus a numeric salary recommendation.
- Controlled counterfactual image sets are required to separate demographic effects from correlated visual confounders.
- Task formulation must be treated as a variable when auditing social bias in multimodal models.
Where Pith is reading between the lines
- The method could be extended to other editable attributes such as age or expression to test whether similar task-dependent patterns appear.
- Models may be relying on subtle facial cues as proxies for unobservable traits like competence when making occupational judgments.
- Audits that ignore task wording risk under- or over-estimating bias in downstream applications.
Load-bearing premise
Editing only facial attributes related to race and gender in real photographs preserves visual realism and does not introduce artifacts that themselves influence model outputs.
What would settle it
If the same five models produce statistically identical outputs on paired counterfactual images that differ solely in the edited facial race or gender attributes, the claim of persistent demographic disparities under visual control would be falsified.
Figures
read the original abstract
Vision-Language Models (VLMs) are increasingly deployed in socially consequential settings, raising concerns about social bias driven by demographic cues. A central challenge in measuring such social bias is attribution under visual confounding: real-world images entangle race and gender with correlated factors such as background and clothing, obscuring attribution. We propose a \textbf{face-only counterfactual evaluation paradigm} that isolates demographic effects while preserving real-image realism. Starting from real photographs, we generate counterfactual variants by editing only facial attributes related to race and gender, keeping all other visual factors fixed. Based on this paradigm, we construct \textbf{FOCUS}, a dataset of 480 scene-matched counterfactual images across six occupations and ten demographic groups, and propose \textbf{REFLECT}, a benchmark comprising three decision-oriented tasks: two-alternative forced choice, multiple-choice socioeconomic inference, and numeric salary recommendation. Experiments on five state-of-the-art VLMs reveal that demographic disparities persist under strict visual control and vary substantially across task formulations. These findings underscore the necessity of controlled, counterfactual audits and highlight task design as a critical factor in evaluating social bias in multimodal models.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes a face-only counterfactual evaluation paradigm for isolating demographic effects (race and gender) in VLMs by editing only facial attributes in real photographs while holding all other visual factors fixed. It introduces the FOCUS dataset (480 scene-matched counterfactual images across six occupations and ten demographic groups) and the REFLECT benchmark (three decision-oriented tasks: two-alternative forced choice, multiple-choice socioeconomic inference, and numeric salary recommendation). Experiments on five state-of-the-art VLMs show that demographic disparities persist under this visual control and vary substantially across task formulations.
Significance. If the counterfactual edits are shown to be artifact-free, the work provides a useful controlled method for attributing bias to demographic cues rather than correlated visual confounders, advancing beyond standard real-image evaluations. The observation that bias magnitude depends on task formulation is a substantive finding that could inform more robust auditing protocols for multimodal models.
major comments (1)
- [Abstract and §3] Abstract and §3 (FOCUS construction): the central claim that 'demographic disparities persist under strict visual control' is load-bearing on the assumption that editing only race/gender facial attributes produces no systematic artifacts (e.g., skin texture, lighting, or identity leakage) that VLMs could exploit. No perceptual similarity metrics, human realism ratings, or control experiments demonstrating unchanged non-demographic inferences are reported, leaving attribution to demographic signals unverified.
minor comments (2)
- [§4] §4 (REFLECT tasks): the prompt templates and exact wording for the three decision tasks are described at a high level; providing the full prompt text used for each VLM would improve reproducibility.
- [Table 1 and Figure 2] Table 1 and Figure 2: axis labels and legend entries use inconsistent demographic group abbreviations; a single consistent notation table would aid readability.
Simulated Author's Rebuttal
We thank the referee for their constructive feedback, which highlights an important aspect of validating our counterfactual editing approach. We address the major comment below and commit to revisions that will strengthen the manuscript.
read point-by-point responses
-
Referee: [Abstract and §3] Abstract and §3 (FOCUS construction): the central claim that 'demographic disparities persist under strict visual control' is load-bearing on the assumption that editing only race/gender facial attributes produces no systematic artifacts (e.g., skin texture, lighting, or identity leakage) that VLMs could exploit. No perceptual similarity metrics, human realism ratings, or control experiments demonstrating unchanged non-demographic inferences are reported, leaving attribution to demographic signals unverified.
Authors: We agree that explicit validation of edit quality is necessary to support attribution of disparities solely to demographic cues. The FOCUS images were generated via a targeted face-editing pipeline designed to modify only race- and gender-related facial attributes while preserving background, lighting, pose, clothing, and scene context. Nevertheless, the original submission did not include quantitative perceptual metrics or human validation. In the revised manuscript we will add: (1) LPIPS and SSIM scores between each original-counterfactual pair, computed both globally and on non-face regions to quantify unintended changes; (2) results from a human study in which raters evaluate photorealism and perceived identity consistency of the edited images; and (3) control experiments in which the same VLMs perform non-demographic tasks (e.g., object recognition or scene-attribute inference) on the counterfactual pairs to verify that non-demographic inferences remain stable. These additions will directly address the concern and provide stronger evidence that observed biases arise from demographic signals rather than editing artifacts. revision: yes
Circularity Check
No circularity: purely empirical dataset construction and model evaluation
full rationale
The paper constructs the FOCUS dataset of counterfactual images by editing real photographs and evaluates five VLMs on the REFLECT benchmark tasks. No mathematical derivations, parameter fitting, or predictions are presented that reduce to the inputs by construction. The central claims derive from direct measurement of model outputs on the new images rather than from any self-referential definitions, fitted inputs renamed as predictions, or load-bearing self-citations. The work is self-contained empirical auditing with no circular steps.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Face editing can isolate race and gender attributes without affecting other visual factors or introducing artifacts that influence model responses
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.lean; IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction; washburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We propose a face-only counterfactual evaluation paradigm that isolates demographic effects while preserving real-image realism... construct FOCUS, a dataset of 480 scene-matched counterfactual images... REFLECT... three decision-oriented tasks
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Qwen3-vl technical re- port.Preprint, arXiv:2511.21631. Yonatan Bitton, Hritik Bansal, Jack Hessel, Rulin Shao, Wanrong Zhu, Anas Awadalla, Josh Gardner, Rohan Taori, and Ludwig Schimdt
work page internal anchor Pith review Pith/arXiv arXiv
-
[2]
Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities.arXiv preprint arXiv:2507.06261. Marta Costa-jussà, Pierre Andrews, Eric Smith, Prangthip Hansanti, Christophe Ropers, Elahe Kalbassi, Cynthia Gao, Daniel Licht, and Carleigh Wood
work page internal anchor Pith review Pith/arXiv arXiv
-
[3]
Multilingual holistic bias: Extending descriptors and patterns to unveil demographic bi- ases in languages at scale. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 14141–14156. Maria De-Arteaga, Alexey Romanov, Hanna Wal- lach, Jennifer Chayes, Christian Borgs, Alexandra Chouldechova, Sahin Geyik, Krishnara...
work page 2023
-
[4]
Bold: Dataset and metrics for measuring biases in open-ended language genera- tion. InProceedings of the 2021 ACM conference on fairness, accountability, and transparency, pages 862–872. Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Amy Yang, Angela Fan, and 1 others
work page 2021
-
[5]
The llama 3 herd of models.arXiv preprint arXiv:2407.21783. Kathleen C Fraser and Svetlana Kiritchenko
work page internal anchor Pith review Pith/arXiv arXiv
-
[6]
InFindings of the Association for Computational Linguistics: EMNLP 2020, pages 3356–3369
Realtoxici- typrompts: Evaluating neural toxic degeneration in language models. InFindings of the Association for Computational Linguistics: EMNLP 2020, pages 3356–3369. Phillip Howard, Kathleen C Fraser, Anahita Bhiwandi- walla, and Svetlana Kiritchenko
work page 2020
-
[7]
Uncovering bias in large vision-language models at scale with counterfactuals. InProceedings of the 2025 Confer- ence of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL), pages 5946–5991. Phillip Howard, Avinash Madasu, Tiep Le, Gustavo Lu- jan Moreno, Anahita Bhiwandiwalla, and Vasudev Lal
work page 2025
-
[8]
Vis- Bias: Measuring explicit and implicit social biases in vision language models. InProceedings of the 2025 Conference on Empirical Methods in Natural Lan- guage Processing (EMNLP), pages 17981–18004. Matt J Kusner, Joshua Loftus, Chris Russell, and Ri- cardo Silva
work page 2025
-
[9]
MediaPipe: A Framework for Building Perception Pipelines
Mediapipe: A framework for building perception pipelines.arXiv preprint arXiv:1906.08172. Moin Nadeem, Anna Bethke, and Siva Reddy
work page internal anchor Pith review Pith/arXiv arXiv 1906
-
[10]
Crows-pairs: A challenge dataset for measuring social biases in masked lan- guage models. InProceedings of the 2020 conference on empirical methods in natural language processing (EMNLP), pages 1953–1967. Huy Nghiem, John Prindle, Jieyu Zhao, and Hal Daumé Iii
work page 2020
-
[11]
“You Gotta be a Doctor, Lin”: An Investiga- tion of Name-Based Bias of Large Language Models in Employment Recommendations. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 7268–
work page 2024
-
[12]
Gpt-5 system card. https://cdn. openai.com/gpt-5-system-card.pdf . Accessed: 2026-01-02. Alicia Parrish, Angelica Chen, Nikita Nangia, Vishakh Padmakumar, Jason Phang, Jana Thompson, Phu Mon Htut, and Samuel Bowman
work page 2026
-
[13]
InFindings of the Association for Computational Linguistics: ACL 2022, pages 2086–2105
Bbq: A hand-built bias benchmark for question answering. InFindings of the Association for Computational Linguistics: ACL 2022, pages 2086–2105. Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sas- try, Amanda Askell, Pamela Mishkin, Jack Clark, and 1 others
work page 2022
-
[14]
Unbiased look at dataset bias. InProceedings of the 2011 IEEE Con- ference on Computer Vision and Pattern Recognition, pages 1521–1528. Zhiyu Wu, Xiaokang Chen, Zizheng Pan, Xingchao Liu, Wen Liu, Damai Dai, Huazuo Gao, Yiyang Ma, Chengyue Wu, Bingxuan Wang, and 1 oth- ers
work page 2011
-
[15]
DeepSeek-VL2: Mixture-of-Experts Vision-Language Models for Advanced Multimodal Understanding
Deepseek-vl2: Mixture-of-experts vision- language models for advanced multimodal under- standing.arXiv preprint arXiv:2412.10302. Yi Zhang, Junyang Wang, and Jitao Sang
work page internal anchor Pith review Pith/arXiv arXiv
-
[16]
Men also like shopping: Reducing gender bias amplification us- ing corpus-level constraints. InProceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 2979–2989. Jieyu Zhao, Tianlu Wang, Mark Yatskar, Vicente Or- donez, and Kai-Wei Chang
work page 2017
-
[17]
Gender bias in 10 coreference resolution: Evaluation and debiasing methods. InProceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers), pages 15–20. Kankan Zhou, Eason Lai, and Jing Jiang
work page 2018
-
[18]
C.2 Significance Tests of Salary Recommendation We complement the mean gap visualizations with regression-based, cluster-robust significance tests. While Figure 6 summarizes effectmagnitudevia mean absolute gaps, Table 3 tests forsystem- atic signed shiftsacross demographic conditions at the unit level, using standard errors clustered by unit (defined by ...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.