pith. sign in

arxiv: 2503.07878 · v5 · submitted 2025-03-10 · 💻 cs.CV · cs.AI

A Woman with a Knife or A Knife with a Woman? Measuring Directional Bias Amplification in Image Captions

Pith reviewed 2026-05-22 23:51 UTC · model grok-4.3

classification 💻 cs.CV cs.AI
keywords bias amplificationimage captioningdirectional metricsgender biasrace biasCOCO captionsLIC metric
0
0 comments X

The pith

A directional metric called DBAC identifies whether image captioning models amplify biases from the image or the text.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes Directional Bias Amplification in Captioning (DBAC) as a language-aware metric that tracks bias amplification while also showing its direction. Prior metrics either ignore caption semantics or, like LIC, cannot distinguish whether amplification originates in the visual input or the generated text. DBAC adds directionality, reduces sensitivity to sentence encoder choice, and yields more accurate estimates on attributes such as gender and race. Experiments on the COCO captions dataset indicate that only the directional formulation reliably surfaces the source of amplification in captioning models.

Core claim

DBAC measures bias amplification in captions in a directional manner, allowing the source of amplification (image versus text) to be identified, which non-directional language-aware metrics cannot do.

What carries the argument

Directional Bias Amplification in Captioning (DBAC), a metric that computes amplification scores while separating the contribution of the image from the contribution of the caption text.

If this is right

  • Captioning models can be diagnosed to determine whether they amplify image biases or text biases on attributes such as gender and race.
  • Targeted interventions can be applied to the identified source once the direction of amplification is known.
  • Bias estimates become more stable across different sentence encoders used to encode captions.
  • Evaluation of captioning systems on datasets like COCO becomes possible with a single reliable metric instead of multiple incomplete ones.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same directional approach could be tested on other multimodal generation tasks where source attribution matters.
  • If directionality proves necessary here, similar reformulations might improve bias measurement in related language-vision settings.
  • Developers could use DBAC scores to decide whether to clean image data, caption data, or both.

Load-bearing premise

Identifying the source of bias amplification requires a directional formulation rather than a non-directional language-aware metric.

What would settle it

Apply DBAC to a captioning model in which bias amplification has been artificially introduced only on the image side and check whether the metric attributes the excess bias to the image rather than the text.

Figures

Figures reproduced from arXiv: 2503.07878 by Bhanu Tokas, Hannah Kerner, Rahul Nair.

Figure 1
Figure 1. Figure 1: Glove Thresholds: Percentage of glove substitutions vs. the selected threshold δ across different models. To select the optimal distance threshold δ for our ex￾periments, we observed how many contextual substitutions occurred at different thresholds. We tested this using the GloVe word embedding model. If δ is large, we allow weaker word substitutions. Con￾sider the HGC: “a <gender> is sleeping”. Assume th… view at source ↗
read the original abstract

When we train models on biased datasets, they not only reproduce data biases, but can worsen them at test time - a phenomenon called bias amplification. Many of the current bias amplification metrics (e.g., BA (MALS), DPA) measure bias amplification only in classification datasets. These metrics are ineffective for image captioning datasets, as they cannot capture the language semantics of a caption. Recent work introduced Leakage in Captioning (LIC), a language-aware bias amplification metric that understands caption semantics. However, LIC has a crucial limitation: it cannot identify the source of bias amplification in captioning models. We propose Directional Bias Amplification in Captioning (DBAC), a language-aware and directional metric that can identify when captioning models amplify biases. DBAC has two more improvements over LIC: (1) it is less sensitive to sentence encoders (a hyperparameter in language-aware metrics), and (2) it provides a more accurate estimate of bias amplification in captions. Our experiments on gender and race attributes in the COCO captions dataset show that DBAC is the only reliable metric to measure bias amplification in captions.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces Directional Bias Amplification in Captioning (DBAC), a language-aware directional metric for quantifying bias amplification in image captioning. It positions DBAC as addressing limitations of classification-only metrics (BA, DPA) and of LIC (language-aware but non-directional, hence unable to identify bias source). DBAC is claimed to be less sensitive to sentence-encoder choice and to yield more accurate estimates; experiments on gender and race attributes in COCO captions are used to conclude that DBAC is the only reliable metric for this task.

Significance. If the empirical claims hold, DBAC would supply a practical tool for attributing the direction of bias amplification (image-to-caption or caption-to-image) in vision-language models, enabling more targeted fairness interventions than existing non-directional alternatives. The evaluation on the widely used COCO dataset adds practical relevance for captioning applications.

major comments (2)
  1. [Experiments] Experiments section: The central claim that DBAC is 'the only reliable metric' rests on the assertion that LIC's non-directional character prevents source identification. No ablation is reported that removes directionality from DBAC (or augments LIC with post-hoc conditional-probability analysis) while preserving language awareness, leaving open the possibility that reported superiority arises from other design choices such as normalization or encoder robustness rather than directionality itself.
  2. [Method] Method / DBAC definition: The manuscript states that a directional formulation is necessary to identify the source of amplification, yet provides no formal argument or counter-example demonstrating that a symmetric language-aware metric cannot recover direction via post-hoc analysis of conditional probabilities. This gap directly affects the load-bearing claim that non-directional metrics are inherently insufficient.
minor comments (2)
  1. [Abstract] Abstract: The sentence describing LIC's 'crucial limitation' is stated without a short illustrative example; adding one would clarify why directionality is required for source attribution.
  2. Notation: Ensure consistent use of symbols for the directional components (e.g., image-to-caption vs. caption-to-image flows) across equations and text to avoid reader confusion.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below and outline revisions to strengthen the claims regarding DBAC's directionality.

read point-by-point responses
  1. Referee: [Experiments] Experiments section: The central claim that DBAC is 'the only reliable metric' rests on the assertion that LIC's non-directional character prevents source identification. No ablation is reported that removes directionality from DBAC (or augments LIC with post-hoc conditional-probability analysis) while preserving language awareness, leaving open the possibility that reported superiority arises from other design choices such as normalization or encoder robustness rather than directionality itself.

    Authors: We agree that an ablation isolating the effect of directionality is needed to rule out contributions from other design choices. In the revised manuscript we will add an ablation that symmetrizes DBAC (averaging the two directional terms) while retaining language awareness and normalization, then directly compare its source-identification accuracy to LIC on the COCO gender and race tasks. revision: yes

  2. Referee: [Method] Method / DBAC definition: The manuscript states that a directional formulation is necessary to identify the source of amplification, yet provides no formal argument or counter-example demonstrating that a symmetric language-aware metric cannot recover direction via post-hoc analysis of conditional probabilities. This gap directly affects the load-bearing claim that non-directional metrics are inherently insufficient.

    Authors: We acknowledge the missing formal justification. The revision will include a short subsection with a counter-example: a joint distribution over images and captions where post-hoc conditioning on a symmetrized language-aware score yields ambiguous or reversed direction attribution, showing why directionality must be intrinsic rather than recovered after the fact. revision: yes

Circularity Check

0 steps flagged

No circularity: new metric proposal rests on external experiments, not self-definition or self-citation

full rationale

The paper introduces DBAC as a directional, language-aware metric and asserts its superiority over LIC via experiments on COCO gender/race attributes. No equations, fitted parameters, or derivations appear in the abstract or described chain that reduce by construction to the inputs (e.g., no self-definitional ratio or prediction that is the fit itself). The central claim that DBAC is the only reliable metric is supported by empirical comparison rather than a load-bearing self-citation or uniqueness theorem imported from the authors' prior work. The provided text contains no self-citations at all, let alone ones that justify the necessity of directionality. This is a standard case of a self-contained empirical proposal against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only; no free parameters, axioms, or invented entities are identifiable from the provided text.

pith-pipeline@v0.9.0 · 5734 in / 1021 out tokens · 32749 ms · 2026-05-22T23:51:48.922276+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

43 extracted references · 43 canonical work pages · 3 internal anchors

  1. [1]

    Imagecaptioner2: Image captioner for im- age captioning bias amplification assessment.Proceedings of the AAAI Conference on Artificial Intelligence, 38(19): 20902–20911, 2024

    Eslam Abdelrahman, Pengzhan Sun, Li Erran Li, and Mo- hamed Elhoseiny. Imagecaptioner2: Image captioner for im- age captioning bias amplification assessment.Proceedings of the AAAI Conference on Artificial Intelligence, 38(19): 20902–20911, 2024. 1, 2

  2. [2]

    Artemis: Affective language for visual art

    Panos Achlioptas, Maks Ovsjanikov, Kilichbek Haydarov, Mohamed Elhoseiny, and Leonidas J Guibas. Artemis: Affective language for visual art. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11569–11579, 2021. 4

  3. [3]

    Anderson, X

    P. Anderson, X. He, C. Buehler, D. Teney, M. Johnson, S. Gould, and L. Zhang. Bottom-up and top-down atten- tion for image captioning and visual question answering. In 2018 IEEE/CVF Conference on Computer Vision and Pat- tern Recognition (CVPR), pages 6077–6086, Los Alamitos, CA, USA, 2018. IEEE Computer Society. 4, 6

  4. [4]

    Enriching Word Vectors with Subword Information

    Piotr Bojanowski, Edouard Grave, Armand Joulin, and Tomas Mikolov. Enriching word vectors with subword in- formation.arXiv preprint arXiv:1607.04606, 2016. 4, 7, 8

  5. [5]

    Man is to computer program- mer as woman is to homemaker? debiasing word embed- dings.Advances in neural information processing systems, 29, 2016

    Tolga Bolukbasi, Kai-Wei Chang, James Y Zou, Venkatesh Saligrama, and Adam T Kalai. Man is to computer program- mer as woman is to homemaker? debiasing word embed- dings.Advances in neural information processing systems, 29, 2016. 8

  6. [6]

    Microsoft COCO Captions: Data Collection and Evaluation Server

    Xinlei Chen, Hao Fang, Tsung-Yi Lin, Ramakrishna Vedan- tam, Saurabh Gupta, Piotr Doll ´ar, and C Lawrence Zitnick. Microsoft coco captions: Data collection and evaluation server.arXiv preprint arXiv:1504.00325, 2015. 2, 4

  7. [7]

    Meteor universal: Lan- guage specific translation evaluation for any target language

    Michael Denkowski and Alon Lavie. Meteor universal: Lan- guage specific translation evaluation for any target language. InProceedings of the Ninth Workshop on Statistical Machine Translation, pages 376–380, Baltimore, Maryland, USA,

  8. [8]

    Association for Computational Linguistics. 7

  9. [9]

    Foulds, Rashidul Islam, Kamrun Naher Keya, and Shimei Pan

    James R. Foulds, Rashidul Islam, Kamrun Naher Keya, and Shimei Pan. An intersectional definition of fairness. In2020 IEEE 36th International Conference on Data Engineering (ICDE), pages 1918–1921, 2020. 2

  10. [10]

    Women also snowboard: Over- coming bias in captioning models

    Lisa Anne Hendricks, Kaylee Burns, Kate Saenko, Trevor Darrell, and Anna Rohrbach. Women also snowboard: Over- coming bias in captioning models. InProceedings of the Eu- ropean Conference on Computer Vision (ECCV), 2018. 2, 4, 6

  11. [11]

    Hirota, Y

    Y . Hirota, Y . Nakashima, and N. Garcia. Quantifying societal bias amplification in image captioning. In2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 13440–13449, Los Alamitos, CA, USA,

  12. [12]

    IEEE Computer Society. 1, 2, 4

  13. [13]

    Long short-term memory.Neural computation, 9(8):1735–1780, 1997

    Sepp Hochreiter and J ¨urgen Schmidhuber. Long short-term memory.Neural computation, 9(8):1735–1780, 1997. 6, 7

  14. [14]

    Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation

    Junnan Li, Dongxu Li, Caiming Xiong, and Steven Hoi. Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. InICML,

  15. [15]

    Oscar: Object-semantics aligned pre-training for vision-language tasks

    Xiujun Li, Xi Yin, Chunyuan Li, Pengchuan Zhang, Xiaowei Hu, Lei Zhang, Lijuan Wang, Houdong Hu, Li Dong, Furu Wei, et al. Oscar: Object-semantics aligned pre-training for vision-language tasks. InComputer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXX 16, pages 121–137. Springer,

  16. [16]

    Crank up the volume: preference bias amplification in collaborative recommendation, 2019

    Kun Lin, Nasim Sonboli, Bamshad Mobasher, and Robin Burke. Crank up the volume: preference bias amplification in collaborative recommendation, 2019. 2

  17. [17]

    Improved baselines with visual instruction tuning, 2023

    Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved baselines with visual instruction tuning, 2023. 4, 6

  18. [18]

    Visual instruction tuning, 2023

    Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning, 2023. 4, 6

  19. [19]

    Bow- man, and Rachel Rudinger

    Chandler May, Alex Wang, Shikha Bordia, Samuel R. Bow- man, and Rachel Rudinger. On measuring social biases in sentence encoders. InProceedings of the 2019 Conference of the North American Chapter of the Association for Com- putational Linguistics: Human Language Technologies, Vol- ume 1 (Long and Short Papers), pages 622–628, Minneapo- lis, Minnesota, 20...

  20. [20]

    Gender ar- tifacts in visual datasets

    Nicole Meister, Dora Zhao, Angelina Wang, Vikram V Ra- maswamy, Ruth Fong, and Olga Russakovsky. Gender ar- tifacts in visual datasets. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 4837– 4848, 2023. 2

  21. [21]

    Swapneel Mishra, Saumya Seth, Shrishti Jain, Vasudev Pant, Jolly Parikh, Rachna Jain, and Sardar M.N. Islam. Image caption generation using vision transformer and gpt archi- tecture. In2024 2nd International Conference on Advance- ment in Computation & Computer Technologies (InCACCT), pages 1–6, 2024. 4, 6

  22. [22]

    It is okay to not be okay: Overcoming emotional bias in affective image cap- tioning by contrastive data collection

    Youssef Mohamed, Faizan Farooq Khan, Kilichbek Hay- darov, and Mohamed Elhoseiny. It is okay to not be okay: Overcoming emotional bias in affective image cap- tioning by contrastive data collection. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 21263–21272, 2022. 4

  23. [23]

    GloVe: Global vectors for word representation

    Jeffrey Pennington, Richard Socher, and Christopher Man- ning. GloVe: Global vectors for word representation. InPro- ceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 1532–1543, Doha, Qatar, 2014. Association for Computational Linguis- tics. 4, 7, 8

  24. [24]

    Gender biases in automatic evaluation met- rics for image captioning.arXiv preprint arXiv:2305.14711,

    Haoyi Qiu, Zi-Yi Dou, Tianlu Wang, Asli Celikyilmaz, and Nanyun Peng. Gender biases in automatic evaluation met- rics for image captioning.arXiv preprint arXiv:2305.14711,

  25. [25]

    Self-critical sequence training for image captioning

    Steven J Rennie, Etienne Marcheret, Youssef Mroueh, Jerret Ross, and Vaibhava Goel. Self-critical sequence training for image captioning. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 7008–7024,

  26. [26]

    DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter

    Victor Sanh, Lysandre Debut, Julien Chaumond, and Thomas Wolf. Distilbert, a distilled version of bert: smaller, faster, cheaper and lighter.ArXiv, abs/1910.01108, 2019. 7 9

  27. [27]

    Recurrent neural networks (rnns): A gentle introduction and overview.arXiv preprint arXiv:1912.05911, 2019

    Robin M Schmidt. Recurrent neural networks (rnns): A gentle introduction and overview.arXiv preprint arXiv:1912.05911, 2019. 7

  28. [28]

    Selvaraju, Michael Cogswell, Abhishek Das, Ramakrishna Vedantam, Devi Parikh, and Dhruv Ba- tra

    Ramprasaath R. Selvaraju, Michael Cogswell, Abhishek Das, Ramakrishna Vedantam, Devi Parikh, and Dhruv Ba- tra. Grad-cam: Visual explanations from deep networks via gradient-based localization. In2017 IEEE International Conference on Computer Vision (ICCV), pages 618–626,

  29. [29]

    The bias amplification paradox in text-to-image generation

    Preethi Seshadri, Sameer Singh, and Yanai Elazar. The bias amplification paradox in text-to-image generation. In Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pages 6367–6384, Mexico City, Mexico, 2024. Association for Computational Li...

  30. [30]

    Mpnet: Masked and permuted pre-training for language understanding.arXiv preprint arXiv:2004.09297, 2020

    Kaitao Song, Xu Tan, Tao Qin, Jianfeng Lu, and Tie-Yan Liu. Mpnet: Masked and permuted pre-training for language understanding.arXiv preprint arXiv:2004.09297, 2020. 7

  31. [31]

    An accurate generation of image captions for blind people using extended convolutional atom neural network.Multimedia Tools and Applications, 82(3):3801–3830, 2023

    Tejal Tiwary and Rajendra Prasad Mahapatra. An accurate generation of image captions for blind people using extended convolutional atom neural network.Multimedia Tools and Applications, 82(3):3801–3830, 2023. 1

  32. [32]

    Making bias amplification in balanced datasets directional and inter- pretable, 2024

    Bhanu Tokas, Rahul Nair, and Hannah Kerner. Making bias amplification in balanced datasets directional and inter- pretable, 2024. 1, 2, 3

  33. [33]

    Attention is all you need

    Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszko- reit, Llion Jones, Aidan N Gomez, Ł ukasz Kaiser, and Illia Polosukhin. Attention is all you need. InAdvances in Neu- ral Information Processing Systems. Curran Associates, Inc.,

  34. [34]

    Exploring and mitigating gender bias in glove word embeddings

    Mauro Vera. Exploring and mitigating gender bias in glove word embeddings. 2018. 8

  35. [35]

    Show and tell: A neural image caption gen- erator

    Oriol Vinyals, Alexander Toshev, Samy Bengio, and Du- mitru Erhan. Show and tell: A neural image caption gen- erator. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 3156–3164, 2015. 4, 6

  36. [36]

    Directional bias am- plification

    Angelina Wang and Olga Russakovsky. Directional bias am- plification. InProceedings of the 38th International Con- ference on Machine Learning, ICML 2021, pages 10882– 10893. ML Research Press, 2021. Publisher Copyright: Copyright © 2021 by the author(s); 38th International Con- ference on Machine Learning, ICML 2021 ; Conference date: 18-07-2021 Through ...

  37. [37]

    Minilm: deep self-attention distillation for task-agnostic compression of pre-trained transformers

    Wenhui Wang, Furu Wei, Li Dong, Hangbo Bao, Nan Yang, and Ming Zhou. Minilm: deep self-attention distillation for task-agnostic compression of pre-trained transformers. In Proceedings of the 34th International Conference on Neural Information Processing Systems, Red Hook, NY , USA, 2020. Curran Associates Inc. 7

  38. [38]

    Florence-2: Advancing a unified representation for a variety of vision tasks (2023)

    Bin Xiao, Haiping Wu, Weijian Xu, Xiyang Dai, Houdong Hu, Yumao Lu, Michael Zeng, Ce Liu, and Lu Yuan. Florence-2: Advancing a unified representation for a vari- ety of vision tasks.arXiv preprint arXiv:2311.06242, 2023. 4, 6

  39. [39]

    Show, attend and tell: Neural image caption gen- eration with visual attention

    Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhudinov, Rich Zemel, and Yoshua Bengio. Show, attend and tell: Neural image caption gen- eration with visual attention. InInternational conference on machine learning, pages 2048–2057. PMLR, 2015. 4, 6

  40. [40]

    Recognize anything: A strong image tagging model.arXiv preprint arXiv:2306.03514,

    Youcai Zhang, Xinyu Huang, Jinyu Ma, Zhaoyang Li, Zhaochuan Luo, Yanchun Xie, Yuzhuo Qin, Tong Luo, Yaqian Li, Shilong Liu, et al. Recognize anything: A strong image tagging model.arXiv preprint arXiv:2306.03514,

  41. [41]

    Men also do laundry: Multi-attribute bias amplification

    Dora Zhao, Jerone Andrews, and Alice Xiang. Men also do laundry: Multi-attribute bias amplification. InProceedings of the 40th International Conference on Machine Learning, pages 42000–42017. PMLR, 2023. 2

  42. [42]

    a<gender>is sleeping

    Jieyu Zhao, Tianlu Wang, Mark Yatskar, Vicente Ordonez, and Kai-Wei Chang. Men also like shopping: Reducing gender bias amplification using corpus-level constraints. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 2979–2989, Copen- hagen, Denmark, 2017. Association for Computational Lin- guistics. 1, 2 10 A...

  43. [43]

    We reported DBACA→T scores on the four pre-trained en- coders in Table 12a

    We reported the corresponding LIC scores in Table 11. We reported DBACA→T scores on the four pre-trained en- coders in Table 12a. We reported the corresponding LIC scores in Table 12b. For race, we reported DBACA→T scores on six sentence encoders trained from scratch, in Table 13. We reported the corresponding LIC scores in Table 14. We reported DBACA→T s...