A Woman with a Knife or A Knife with a Woman? Measuring Directional Bias Amplification in Image Captions

Bhanu Tokas; Hannah Kerner; Rahul Nair

arxiv: 2503.07878 · v5 · submitted 2025-03-10 · 💻 cs.CV · cs.AI

A Woman with a Knife or A Knife with a Woman? Measuring Directional Bias Amplification in Image Captions

Rahul Nair , Bhanu Tokas , Hannah Kerner This is my paper

Pith reviewed 2026-05-22 23:51 UTC · model grok-4.3

classification 💻 cs.CV cs.AI

keywords bias amplificationimage captioningdirectional metricsgender biasrace biasCOCO captionsLIC metric

0 comments

The pith

A directional metric called DBAC identifies whether image captioning models amplify biases from the image or the text.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes Directional Bias Amplification in Captioning (DBAC) as a language-aware metric that tracks bias amplification while also showing its direction. Prior metrics either ignore caption semantics or, like LIC, cannot distinguish whether amplification originates in the visual input or the generated text. DBAC adds directionality, reduces sensitivity to sentence encoder choice, and yields more accurate estimates on attributes such as gender and race. Experiments on the COCO captions dataset indicate that only the directional formulation reliably surfaces the source of amplification in captioning models.

Core claim

DBAC measures bias amplification in captions in a directional manner, allowing the source of amplification (image versus text) to be identified, which non-directional language-aware metrics cannot do.

What carries the argument

Directional Bias Amplification in Captioning (DBAC), a metric that computes amplification scores while separating the contribution of the image from the contribution of the caption text.

If this is right

Captioning models can be diagnosed to determine whether they amplify image biases or text biases on attributes such as gender and race.
Targeted interventions can be applied to the identified source once the direction of amplification is known.
Bias estimates become more stable across different sentence encoders used to encode captions.
Evaluation of captioning systems on datasets like COCO becomes possible with a single reliable metric instead of multiple incomplete ones.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same directional approach could be tested on other multimodal generation tasks where source attribution matters.
If directionality proves necessary here, similar reformulations might improve bias measurement in related language-vision settings.
Developers could use DBAC scores to decide whether to clean image data, caption data, or both.

Load-bearing premise

Identifying the source of bias amplification requires a directional formulation rather than a non-directional language-aware metric.

What would settle it

Apply DBAC to a captioning model in which bias amplification has been artificially introduced only on the image side and check whether the metric attributes the excess bias to the image rather than the text.

Figures

Figures reproduced from arXiv: 2503.07878 by Bhanu Tokas, Hannah Kerner, Rahul Nair.

**Figure 1.** Figure 1: Glove Thresholds: Percentage of glove substitutions vs. the selected threshold δ across different models. To select the optimal distance threshold δ for our experiments, we observed how many contextual substitutions occurred at different thresholds. We tested this using the GloVe word embedding model. If δ is large, we allow weaker word substitutions. Consider the HGC: “a <gender> is sleeping”. Assume th… view at source ↗

read the original abstract

When we train models on biased datasets, they not only reproduce data biases, but can worsen them at test time - a phenomenon called bias amplification. Many of the current bias amplification metrics (e.g., BA (MALS), DPA) measure bias amplification only in classification datasets. These metrics are ineffective for image captioning datasets, as they cannot capture the language semantics of a caption. Recent work introduced Leakage in Captioning (LIC), a language-aware bias amplification metric that understands caption semantics. However, LIC has a crucial limitation: it cannot identify the source of bias amplification in captioning models. We propose Directional Bias Amplification in Captioning (DBAC), a language-aware and directional metric that can identify when captioning models amplify biases. DBAC has two more improvements over LIC: (1) it is less sensitive to sentence encoders (a hyperparameter in language-aware metrics), and (2) it provides a more accurate estimate of bias amplification in captions. Our experiments on gender and race attributes in the COCO captions dataset show that DBAC is the only reliable metric to measure bias amplification in captions.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces Directional Bias Amplification in Captioning (DBAC), a language-aware directional metric for quantifying bias amplification in image captioning. It positions DBAC as addressing limitations of classification-only metrics (BA, DPA) and of LIC (language-aware but non-directional, hence unable to identify bias source). DBAC is claimed to be less sensitive to sentence-encoder choice and to yield more accurate estimates; experiments on gender and race attributes in COCO captions are used to conclude that DBAC is the only reliable metric for this task.

Significance. If the empirical claims hold, DBAC would supply a practical tool for attributing the direction of bias amplification (image-to-caption or caption-to-image) in vision-language models, enabling more targeted fairness interventions than existing non-directional alternatives. The evaluation on the widely used COCO dataset adds practical relevance for captioning applications.

major comments (2)

[Experiments] Experiments section: The central claim that DBAC is 'the only reliable metric' rests on the assertion that LIC's non-directional character prevents source identification. No ablation is reported that removes directionality from DBAC (or augments LIC with post-hoc conditional-probability analysis) while preserving language awareness, leaving open the possibility that reported superiority arises from other design choices such as normalization or encoder robustness rather than directionality itself.
[Method] Method / DBAC definition: The manuscript states that a directional formulation is necessary to identify the source of amplification, yet provides no formal argument or counter-example demonstrating that a symmetric language-aware metric cannot recover direction via post-hoc analysis of conditional probabilities. This gap directly affects the load-bearing claim that non-directional metrics are inherently insufficient.

minor comments (2)

[Abstract] Abstract: The sentence describing LIC's 'crucial limitation' is stated without a short illustrative example; adding one would clarify why directionality is required for source attribution.
Notation: Ensure consistent use of symbols for the directional components (e.g., image-to-caption vs. caption-to-image flows) across equations and text to avoid reader confusion.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below and outline revisions to strengthen the claims regarding DBAC's directionality.

read point-by-point responses

Referee: [Experiments] Experiments section: The central claim that DBAC is 'the only reliable metric' rests on the assertion that LIC's non-directional character prevents source identification. No ablation is reported that removes directionality from DBAC (or augments LIC with post-hoc conditional-probability analysis) while preserving language awareness, leaving open the possibility that reported superiority arises from other design choices such as normalization or encoder robustness rather than directionality itself.

Authors: We agree that an ablation isolating the effect of directionality is needed to rule out contributions from other design choices. In the revised manuscript we will add an ablation that symmetrizes DBAC (averaging the two directional terms) while retaining language awareness and normalization, then directly compare its source-identification accuracy to LIC on the COCO gender and race tasks. revision: yes
Referee: [Method] Method / DBAC definition: The manuscript states that a directional formulation is necessary to identify the source of amplification, yet provides no formal argument or counter-example demonstrating that a symmetric language-aware metric cannot recover direction via post-hoc analysis of conditional probabilities. This gap directly affects the load-bearing claim that non-directional metrics are inherently insufficient.

Authors: We acknowledge the missing formal justification. The revision will include a short subsection with a counter-example: a joint distribution over images and captions where post-hoc conditioning on a symmetrized language-aware score yields ambiguous or reversed direction attribution, showing why directionality must be intrinsic rather than recovered after the fact. revision: yes

Circularity Check

0 steps flagged

No circularity: new metric proposal rests on external experiments, not self-definition or self-citation

full rationale

The paper introduces DBAC as a directional, language-aware metric and asserts its superiority over LIC via experiments on COCO gender/race attributes. No equations, fitted parameters, or derivations appear in the abstract or described chain that reduce by construction to the inputs (e.g., no self-definitional ratio or prediction that is the fit itself). The central claim that DBAC is the only reliable metric is supported by empirical comparison rather than a load-bearing self-citation or uniqueness theorem imported from the authors' prior work. The provided text contains no self-citations at all, let alone ones that justify the necessity of directionality. This is a standard case of a self-contained empirical proposal against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only; no free parameters, axioms, or invented entities are identifiable from the provided text.

pith-pipeline@v0.9.0 · 5734 in / 1021 out tokens · 32749 ms · 2026-05-22T23:51:48.922276+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

43 extracted references · 43 canonical work pages · 3 internal anchors

[1]

Imagecaptioner2: Image captioner for im- age captioning bias amplification assessment.Proceedings of the AAAI Conference on Artificial Intelligence, 38(19): 20902–20911, 2024

Eslam Abdelrahman, Pengzhan Sun, Li Erran Li, and Mo- hamed Elhoseiny. Imagecaptioner2: Image captioner for im- age captioning bias amplification assessment.Proceedings of the AAAI Conference on Artificial Intelligence, 38(19): 20902–20911, 2024. 1, 2

work page 2024
[2]

Artemis: Affective language for visual art

Panos Achlioptas, Maks Ovsjanikov, Kilichbek Haydarov, Mohamed Elhoseiny, and Leonidas J Guibas. Artemis: Affective language for visual art. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11569–11579, 2021. 4

work page 2021
[3]

Anderson, X

P. Anderson, X. He, C. Buehler, D. Teney, M. Johnson, S. Gould, and L. Zhang. Bottom-up and top-down atten- tion for image captioning and visual question answering. In 2018 IEEE/CVF Conference on Computer Vision and Pat- tern Recognition (CVPR), pages 6077–6086, Los Alamitos, CA, USA, 2018. IEEE Computer Society. 4, 6

work page 2018
[4]

Enriching Word Vectors with Subword Information

Piotr Bojanowski, Edouard Grave, Armand Joulin, and Tomas Mikolov. Enriching word vectors with subword in- formation.arXiv preprint arXiv:1607.04606, 2016. 4, 7, 8

work page internal anchor Pith review Pith/arXiv arXiv 2016
[5]

Man is to computer program- mer as woman is to homemaker? debiasing word embed- dings.Advances in neural information processing systems, 29, 2016

Tolga Bolukbasi, Kai-Wei Chang, James Y Zou, Venkatesh Saligrama, and Adam T Kalai. Man is to computer program- mer as woman is to homemaker? debiasing word embed- dings.Advances in neural information processing systems, 29, 2016. 8

work page 2016
[6]

Microsoft COCO Captions: Data Collection and Evaluation Server

Xinlei Chen, Hao Fang, Tsung-Yi Lin, Ramakrishna Vedan- tam, Saurabh Gupta, Piotr Doll ´ar, and C Lawrence Zitnick. Microsoft coco captions: Data collection and evaluation server.arXiv preprint arXiv:1504.00325, 2015. 2, 4

work page internal anchor Pith review Pith/arXiv arXiv 2015
[7]

Meteor universal: Lan- guage specific translation evaluation for any target language

Michael Denkowski and Alon Lavie. Meteor universal: Lan- guage specific translation evaluation for any target language. InProceedings of the Ninth Workshop on Statistical Machine Translation, pages 376–380, Baltimore, Maryland, USA,

work page
[8]

Association for Computational Linguistics. 7

work page
[9]

Foulds, Rashidul Islam, Kamrun Naher Keya, and Shimei Pan

James R. Foulds, Rashidul Islam, Kamrun Naher Keya, and Shimei Pan. An intersectional definition of fairness. In2020 IEEE 36th International Conference on Data Engineering (ICDE), pages 1918–1921, 2020. 2

work page 1918
[10]

Women also snowboard: Over- coming bias in captioning models

Lisa Anne Hendricks, Kaylee Burns, Kate Saenko, Trevor Darrell, and Anna Rohrbach. Women also snowboard: Over- coming bias in captioning models. InProceedings of the Eu- ropean Conference on Computer Vision (ECCV), 2018. 2, 4, 6

work page 2018
[11]

Hirota, Y

Y . Hirota, Y . Nakashima, and N. Garcia. Quantifying societal bias amplification in image captioning. In2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 13440–13449, Los Alamitos, CA, USA,

work page
[12]

IEEE Computer Society. 1, 2, 4

work page
[13]

Long short-term memory.Neural computation, 9(8):1735–1780, 1997

Sepp Hochreiter and J ¨urgen Schmidhuber. Long short-term memory.Neural computation, 9(8):1735–1780, 1997. 6, 7

work page 1997
[14]

Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation

Junnan Li, Dongxu Li, Caiming Xiong, and Steven Hoi. Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. InICML,

work page
[15]

Oscar: Object-semantics aligned pre-training for vision-language tasks

Xiujun Li, Xi Yin, Chunyuan Li, Pengchuan Zhang, Xiaowei Hu, Lei Zhang, Lijuan Wang, Houdong Hu, Li Dong, Furu Wei, et al. Oscar: Object-semantics aligned pre-training for vision-language tasks. InComputer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXX 16, pages 121–137. Springer,

work page 2020
[16]

Crank up the volume: preference bias amplification in collaborative recommendation, 2019

Kun Lin, Nasim Sonboli, Bamshad Mobasher, and Robin Burke. Crank up the volume: preference bias amplification in collaborative recommendation, 2019. 2

work page 2019
[17]

Improved baselines with visual instruction tuning, 2023

Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved baselines with visual instruction tuning, 2023. 4, 6

work page 2023
[18]

Visual instruction tuning, 2023

Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning, 2023. 4, 6

work page 2023
[19]

Bow- man, and Rachel Rudinger

Chandler May, Alex Wang, Shikha Bordia, Samuel R. Bow- man, and Rachel Rudinger. On measuring social biases in sentence encoders. InProceedings of the 2019 Conference of the North American Chapter of the Association for Com- putational Linguistics: Human Language Technologies, Vol- ume 1 (Long and Short Papers), pages 622–628, Minneapo- lis, Minnesota, 20...

work page 2019
[20]

Gender ar- tifacts in visual datasets

Nicole Meister, Dora Zhao, Angelina Wang, Vikram V Ra- maswamy, Ruth Fong, and Olga Russakovsky. Gender ar- tifacts in visual datasets. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 4837– 4848, 2023. 2

work page 2023
[21]

Swapneel Mishra, Saumya Seth, Shrishti Jain, Vasudev Pant, Jolly Parikh, Rachna Jain, and Sardar M.N. Islam. Image caption generation using vision transformer and gpt archi- tecture. In2024 2nd International Conference on Advance- ment in Computation & Computer Technologies (InCACCT), pages 1–6, 2024. 4, 6

work page 2024
[22]

It is okay to not be okay: Overcoming emotional bias in affective image cap- tioning by contrastive data collection

Youssef Mohamed, Faizan Farooq Khan, Kilichbek Hay- darov, and Mohamed Elhoseiny. It is okay to not be okay: Overcoming emotional bias in affective image cap- tioning by contrastive data collection. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 21263–21272, 2022. 4

work page 2022
[23]

GloVe: Global vectors for word representation

Jeffrey Pennington, Richard Socher, and Christopher Man- ning. GloVe: Global vectors for word representation. InPro- ceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 1532–1543, Doha, Qatar, 2014. Association for Computational Linguis- tics. 4, 7, 8

work page 2014
[24]

Gender biases in automatic evaluation met- rics for image captioning.arXiv preprint arXiv:2305.14711,

Haoyi Qiu, Zi-Yi Dou, Tianlu Wang, Asli Celikyilmaz, and Nanyun Peng. Gender biases in automatic evaluation met- rics for image captioning.arXiv preprint arXiv:2305.14711,

work page arXiv
[25]

Self-critical sequence training for image captioning

Steven J Rennie, Etienne Marcheret, Youssef Mroueh, Jerret Ross, and Vaibhava Goel. Self-critical sequence training for image captioning. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 7008–7024,

work page
[26]

DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter

Victor Sanh, Lysandre Debut, Julien Chaumond, and Thomas Wolf. Distilbert, a distilled version of bert: smaller, faster, cheaper and lighter.ArXiv, abs/1910.01108, 2019. 7 9

work page internal anchor Pith review Pith/arXiv arXiv 1910
[27]

Recurrent neural networks (rnns): A gentle introduction and overview.arXiv preprint arXiv:1912.05911, 2019

Robin M Schmidt. Recurrent neural networks (rnns): A gentle introduction and overview.arXiv preprint arXiv:1912.05911, 2019. 7

work page arXiv 1912
[28]

Selvaraju, Michael Cogswell, Abhishek Das, Ramakrishna Vedantam, Devi Parikh, and Dhruv Ba- tra

Ramprasaath R. Selvaraju, Michael Cogswell, Abhishek Das, Ramakrishna Vedantam, Devi Parikh, and Dhruv Ba- tra. Grad-cam: Visual explanations from deep networks via gradient-based localization. In2017 IEEE International Conference on Computer Vision (ICCV), pages 618–626,

work page
[29]

The bias amplification paradox in text-to-image generation

Preethi Seshadri, Sameer Singh, and Yanai Elazar. The bias amplification paradox in text-to-image generation. In Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pages 6367–6384, Mexico City, Mexico, 2024. Association for Computational Li...

work page 2024
[30]

Mpnet: Masked and permuted pre-training for language understanding.arXiv preprint arXiv:2004.09297, 2020

Kaitao Song, Xu Tan, Tao Qin, Jianfeng Lu, and Tie-Yan Liu. Mpnet: Masked and permuted pre-training for language understanding.arXiv preprint arXiv:2004.09297, 2020. 7

work page arXiv 2004
[31]

An accurate generation of image captions for blind people using extended convolutional atom neural network.Multimedia Tools and Applications, 82(3):3801–3830, 2023

Tejal Tiwary and Rajendra Prasad Mahapatra. An accurate generation of image captions for blind people using extended convolutional atom neural network.Multimedia Tools and Applications, 82(3):3801–3830, 2023. 1

work page 2023
[32]

Making bias amplification in balanced datasets directional and inter- pretable, 2024

Bhanu Tokas, Rahul Nair, and Hannah Kerner. Making bias amplification in balanced datasets directional and inter- pretable, 2024. 1, 2, 3

work page 2024
[33]

Attention is all you need

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszko- reit, Llion Jones, Aidan N Gomez, Ł ukasz Kaiser, and Illia Polosukhin. Attention is all you need. InAdvances in Neu- ral Information Processing Systems. Curran Associates, Inc.,

work page
[34]

Exploring and mitigating gender bias in glove word embeddings

Mauro Vera. Exploring and mitigating gender bias in glove word embeddings. 2018. 8

work page 2018
[35]

Show and tell: A neural image caption gen- erator

Oriol Vinyals, Alexander Toshev, Samy Bengio, and Du- mitru Erhan. Show and tell: A neural image caption gen- erator. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 3156–3164, 2015. 4, 6

work page 2015
[36]

Directional bias am- plification

Angelina Wang and Olga Russakovsky. Directional bias am- plification. InProceedings of the 38th International Con- ference on Machine Learning, ICML 2021, pages 10882– 10893. ML Research Press, 2021. Publisher Copyright: Copyright © 2021 by the author(s); 38th International Con- ference on Machine Learning, ICML 2021 ; Conference date: 18-07-2021 Through ...

work page 2021
[37]

Minilm: deep self-attention distillation for task-agnostic compression of pre-trained transformers

Wenhui Wang, Furu Wei, Li Dong, Hangbo Bao, Nan Yang, and Ming Zhou. Minilm: deep self-attention distillation for task-agnostic compression of pre-trained transformers. In Proceedings of the 34th International Conference on Neural Information Processing Systems, Red Hook, NY , USA, 2020. Curran Associates Inc. 7

work page 2020
[38]

Florence-2: Advancing a unified representation for a variety of vision tasks (2023)

Bin Xiao, Haiping Wu, Weijian Xu, Xiyang Dai, Houdong Hu, Yumao Lu, Michael Zeng, Ce Liu, and Lu Yuan. Florence-2: Advancing a unified representation for a vari- ety of vision tasks.arXiv preprint arXiv:2311.06242, 2023. 4, 6

work page arXiv 2023
[39]

Show, attend and tell: Neural image caption gen- eration with visual attention

Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhudinov, Rich Zemel, and Yoshua Bengio. Show, attend and tell: Neural image caption gen- eration with visual attention. InInternational conference on machine learning, pages 2048–2057. PMLR, 2015. 4, 6

work page 2048
[40]

Recognize anything: A strong image tagging model.arXiv preprint arXiv:2306.03514,

Youcai Zhang, Xinyu Huang, Jinyu Ma, Zhaoyang Li, Zhaochuan Luo, Yanchun Xie, Yuzhuo Qin, Tong Luo, Yaqian Li, Shilong Liu, et al. Recognize anything: A strong image tagging model.arXiv preprint arXiv:2306.03514,

work page arXiv
[41]

Men also do laundry: Multi-attribute bias amplification

Dora Zhao, Jerone Andrews, and Alice Xiang. Men also do laundry: Multi-attribute bias amplification. InProceedings of the 40th International Conference on Machine Learning, pages 42000–42017. PMLR, 2023. 2

work page 2023
[42]

a<gender>is sleeping

Jieyu Zhao, Tianlu Wang, Mark Yatskar, Vicente Ordonez, and Kai-Wei Chang. Men also like shopping: Reducing gender bias amplification using corpus-level constraints. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 2979–2989, Copen- hagen, Denmark, 2017. Association for Computational Lin- guistics. 1, 2 10 A...

work page 2017
[43]

We reported DBACA→T scores on the four pre-trained en- coders in Table 12a

We reported the corresponding LIC scores in Table 11. We reported DBACA→T scores on the four pre-trained en- coders in Table 12a. We reported the corresponding LIC scores in Table 12b. For race, we reported DBACA→T scores on six sentence encoders trained from scratch, in Table 13. We reported the corresponding LIC scores in Table 14. We reported DBACA→T s...

work page 1941

[1] [1]

Imagecaptioner2: Image captioner for im- age captioning bias amplification assessment.Proceedings of the AAAI Conference on Artificial Intelligence, 38(19): 20902–20911, 2024

Eslam Abdelrahman, Pengzhan Sun, Li Erran Li, and Mo- hamed Elhoseiny. Imagecaptioner2: Image captioner for im- age captioning bias amplification assessment.Proceedings of the AAAI Conference on Artificial Intelligence, 38(19): 20902–20911, 2024. 1, 2

work page 2024

[2] [2]

Artemis: Affective language for visual art

Panos Achlioptas, Maks Ovsjanikov, Kilichbek Haydarov, Mohamed Elhoseiny, and Leonidas J Guibas. Artemis: Affective language for visual art. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11569–11579, 2021. 4

work page 2021

[3] [3]

Anderson, X

P. Anderson, X. He, C. Buehler, D. Teney, M. Johnson, S. Gould, and L. Zhang. Bottom-up and top-down atten- tion for image captioning and visual question answering. In 2018 IEEE/CVF Conference on Computer Vision and Pat- tern Recognition (CVPR), pages 6077–6086, Los Alamitos, CA, USA, 2018. IEEE Computer Society. 4, 6

work page 2018

[4] [4]

Enriching Word Vectors with Subword Information

Piotr Bojanowski, Edouard Grave, Armand Joulin, and Tomas Mikolov. Enriching word vectors with subword in- formation.arXiv preprint arXiv:1607.04606, 2016. 4, 7, 8

work page internal anchor Pith review Pith/arXiv arXiv 2016

[5] [5]

Man is to computer program- mer as woman is to homemaker? debiasing word embed- dings.Advances in neural information processing systems, 29, 2016

Tolga Bolukbasi, Kai-Wei Chang, James Y Zou, Venkatesh Saligrama, and Adam T Kalai. Man is to computer program- mer as woman is to homemaker? debiasing word embed- dings.Advances in neural information processing systems, 29, 2016. 8

work page 2016

[6] [6]

Microsoft COCO Captions: Data Collection and Evaluation Server

Xinlei Chen, Hao Fang, Tsung-Yi Lin, Ramakrishna Vedan- tam, Saurabh Gupta, Piotr Doll ´ar, and C Lawrence Zitnick. Microsoft coco captions: Data collection and evaluation server.arXiv preprint arXiv:1504.00325, 2015. 2, 4

work page internal anchor Pith review Pith/arXiv arXiv 2015

[7] [7]

Meteor universal: Lan- guage specific translation evaluation for any target language

Michael Denkowski and Alon Lavie. Meteor universal: Lan- guage specific translation evaluation for any target language. InProceedings of the Ninth Workshop on Statistical Machine Translation, pages 376–380, Baltimore, Maryland, USA,

work page

[8] [8]

Association for Computational Linguistics. 7

work page

[9] [9]

Foulds, Rashidul Islam, Kamrun Naher Keya, and Shimei Pan

James R. Foulds, Rashidul Islam, Kamrun Naher Keya, and Shimei Pan. An intersectional definition of fairness. In2020 IEEE 36th International Conference on Data Engineering (ICDE), pages 1918–1921, 2020. 2

work page 1918

[10] [10]

Women also snowboard: Over- coming bias in captioning models

Lisa Anne Hendricks, Kaylee Burns, Kate Saenko, Trevor Darrell, and Anna Rohrbach. Women also snowboard: Over- coming bias in captioning models. InProceedings of the Eu- ropean Conference on Computer Vision (ECCV), 2018. 2, 4, 6

work page 2018

[11] [11]

Hirota, Y

Y . Hirota, Y . Nakashima, and N. Garcia. Quantifying societal bias amplification in image captioning. In2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 13440–13449, Los Alamitos, CA, USA,

work page

[12] [12]

IEEE Computer Society. 1, 2, 4

work page

[13] [13]

Long short-term memory.Neural computation, 9(8):1735–1780, 1997

Sepp Hochreiter and J ¨urgen Schmidhuber. Long short-term memory.Neural computation, 9(8):1735–1780, 1997. 6, 7

work page 1997

[14] [14]

Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation

Junnan Li, Dongxu Li, Caiming Xiong, and Steven Hoi. Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. InICML,

work page

[15] [15]

Oscar: Object-semantics aligned pre-training for vision-language tasks

Xiujun Li, Xi Yin, Chunyuan Li, Pengchuan Zhang, Xiaowei Hu, Lei Zhang, Lijuan Wang, Houdong Hu, Li Dong, Furu Wei, et al. Oscar: Object-semantics aligned pre-training for vision-language tasks. InComputer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXX 16, pages 121–137. Springer,

work page 2020

[16] [16]

Crank up the volume: preference bias amplification in collaborative recommendation, 2019

Kun Lin, Nasim Sonboli, Bamshad Mobasher, and Robin Burke. Crank up the volume: preference bias amplification in collaborative recommendation, 2019. 2

work page 2019

[17] [17]

Improved baselines with visual instruction tuning, 2023

Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved baselines with visual instruction tuning, 2023. 4, 6

work page 2023

[18] [18]

Visual instruction tuning, 2023

Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning, 2023. 4, 6

work page 2023

[19] [19]

Bow- man, and Rachel Rudinger

Chandler May, Alex Wang, Shikha Bordia, Samuel R. Bow- man, and Rachel Rudinger. On measuring social biases in sentence encoders. InProceedings of the 2019 Conference of the North American Chapter of the Association for Com- putational Linguistics: Human Language Technologies, Vol- ume 1 (Long and Short Papers), pages 622–628, Minneapo- lis, Minnesota, 20...

work page 2019

[20] [20]

Gender ar- tifacts in visual datasets

Nicole Meister, Dora Zhao, Angelina Wang, Vikram V Ra- maswamy, Ruth Fong, and Olga Russakovsky. Gender ar- tifacts in visual datasets. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 4837– 4848, 2023. 2

work page 2023

[21] [21]

Swapneel Mishra, Saumya Seth, Shrishti Jain, Vasudev Pant, Jolly Parikh, Rachna Jain, and Sardar M.N. Islam. Image caption generation using vision transformer and gpt archi- tecture. In2024 2nd International Conference on Advance- ment in Computation & Computer Technologies (InCACCT), pages 1–6, 2024. 4, 6

work page 2024

[22] [22]

It is okay to not be okay: Overcoming emotional bias in affective image cap- tioning by contrastive data collection

Youssef Mohamed, Faizan Farooq Khan, Kilichbek Hay- darov, and Mohamed Elhoseiny. It is okay to not be okay: Overcoming emotional bias in affective image cap- tioning by contrastive data collection. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 21263–21272, 2022. 4

work page 2022

[23] [23]

GloVe: Global vectors for word representation

Jeffrey Pennington, Richard Socher, and Christopher Man- ning. GloVe: Global vectors for word representation. InPro- ceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 1532–1543, Doha, Qatar, 2014. Association for Computational Linguis- tics. 4, 7, 8

work page 2014

[24] [24]

Gender biases in automatic evaluation met- rics for image captioning.arXiv preprint arXiv:2305.14711,

Haoyi Qiu, Zi-Yi Dou, Tianlu Wang, Asli Celikyilmaz, and Nanyun Peng. Gender biases in automatic evaluation met- rics for image captioning.arXiv preprint arXiv:2305.14711,

work page arXiv

[25] [25]

Self-critical sequence training for image captioning

Steven J Rennie, Etienne Marcheret, Youssef Mroueh, Jerret Ross, and Vaibhava Goel. Self-critical sequence training for image captioning. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 7008–7024,

work page

[26] [26]

DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter

Victor Sanh, Lysandre Debut, Julien Chaumond, and Thomas Wolf. Distilbert, a distilled version of bert: smaller, faster, cheaper and lighter.ArXiv, abs/1910.01108, 2019. 7 9

work page internal anchor Pith review Pith/arXiv arXiv 1910

[27] [27]

Recurrent neural networks (rnns): A gentle introduction and overview.arXiv preprint arXiv:1912.05911, 2019

Robin M Schmidt. Recurrent neural networks (rnns): A gentle introduction and overview.arXiv preprint arXiv:1912.05911, 2019. 7

work page arXiv 1912

[28] [28]

Selvaraju, Michael Cogswell, Abhishek Das, Ramakrishna Vedantam, Devi Parikh, and Dhruv Ba- tra

Ramprasaath R. Selvaraju, Michael Cogswell, Abhishek Das, Ramakrishna Vedantam, Devi Parikh, and Dhruv Ba- tra. Grad-cam: Visual explanations from deep networks via gradient-based localization. In2017 IEEE International Conference on Computer Vision (ICCV), pages 618–626,

work page

[29] [29]

The bias amplification paradox in text-to-image generation

Preethi Seshadri, Sameer Singh, and Yanai Elazar. The bias amplification paradox in text-to-image generation. In Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pages 6367–6384, Mexico City, Mexico, 2024. Association for Computational Li...

work page 2024

[30] [30]

Mpnet: Masked and permuted pre-training for language understanding.arXiv preprint arXiv:2004.09297, 2020

Kaitao Song, Xu Tan, Tao Qin, Jianfeng Lu, and Tie-Yan Liu. Mpnet: Masked and permuted pre-training for language understanding.arXiv preprint arXiv:2004.09297, 2020. 7

work page arXiv 2004

[31] [31]

An accurate generation of image captions for blind people using extended convolutional atom neural network.Multimedia Tools and Applications, 82(3):3801–3830, 2023

Tejal Tiwary and Rajendra Prasad Mahapatra. An accurate generation of image captions for blind people using extended convolutional atom neural network.Multimedia Tools and Applications, 82(3):3801–3830, 2023. 1

work page 2023

[32] [32]

Making bias amplification in balanced datasets directional and inter- pretable, 2024

Bhanu Tokas, Rahul Nair, and Hannah Kerner. Making bias amplification in balanced datasets directional and inter- pretable, 2024. 1, 2, 3

work page 2024

[33] [33]

Attention is all you need

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszko- reit, Llion Jones, Aidan N Gomez, Ł ukasz Kaiser, and Illia Polosukhin. Attention is all you need. InAdvances in Neu- ral Information Processing Systems. Curran Associates, Inc.,

work page

[34] [34]

Exploring and mitigating gender bias in glove word embeddings

Mauro Vera. Exploring and mitigating gender bias in glove word embeddings. 2018. 8

work page 2018

[35] [35]

Show and tell: A neural image caption gen- erator

Oriol Vinyals, Alexander Toshev, Samy Bengio, and Du- mitru Erhan. Show and tell: A neural image caption gen- erator. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 3156–3164, 2015. 4, 6

work page 2015

[36] [36]

Directional bias am- plification

Angelina Wang and Olga Russakovsky. Directional bias am- plification. InProceedings of the 38th International Con- ference on Machine Learning, ICML 2021, pages 10882– 10893. ML Research Press, 2021. Publisher Copyright: Copyright © 2021 by the author(s); 38th International Con- ference on Machine Learning, ICML 2021 ; Conference date: 18-07-2021 Through ...

work page 2021

[37] [37]

Minilm: deep self-attention distillation for task-agnostic compression of pre-trained transformers

Wenhui Wang, Furu Wei, Li Dong, Hangbo Bao, Nan Yang, and Ming Zhou. Minilm: deep self-attention distillation for task-agnostic compression of pre-trained transformers. In Proceedings of the 34th International Conference on Neural Information Processing Systems, Red Hook, NY , USA, 2020. Curran Associates Inc. 7

work page 2020

[38] [38]

Florence-2: Advancing a unified representation for a variety of vision tasks (2023)

Bin Xiao, Haiping Wu, Weijian Xu, Xiyang Dai, Houdong Hu, Yumao Lu, Michael Zeng, Ce Liu, and Lu Yuan. Florence-2: Advancing a unified representation for a vari- ety of vision tasks.arXiv preprint arXiv:2311.06242, 2023. 4, 6

work page arXiv 2023

[39] [39]

Show, attend and tell: Neural image caption gen- eration with visual attention

Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhudinov, Rich Zemel, and Yoshua Bengio. Show, attend and tell: Neural image caption gen- eration with visual attention. InInternational conference on machine learning, pages 2048–2057. PMLR, 2015. 4, 6

work page 2048

[40] [40]

Recognize anything: A strong image tagging model.arXiv preprint arXiv:2306.03514,

Youcai Zhang, Xinyu Huang, Jinyu Ma, Zhaoyang Li, Zhaochuan Luo, Yanchun Xie, Yuzhuo Qin, Tong Luo, Yaqian Li, Shilong Liu, et al. Recognize anything: A strong image tagging model.arXiv preprint arXiv:2306.03514,

work page arXiv

[41] [41]

Men also do laundry: Multi-attribute bias amplification

Dora Zhao, Jerone Andrews, and Alice Xiang. Men also do laundry: Multi-attribute bias amplification. InProceedings of the 40th International Conference on Machine Learning, pages 42000–42017. PMLR, 2023. 2

work page 2023

[42] [42]

a<gender>is sleeping

Jieyu Zhao, Tianlu Wang, Mark Yatskar, Vicente Ordonez, and Kai-Wei Chang. Men also like shopping: Reducing gender bias amplification using corpus-level constraints. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 2979–2989, Copen- hagen, Denmark, 2017. Association for Computational Lin- guistics. 1, 2 10 A...

work page 2017

[43] [43]

We reported DBACA→T scores on the four pre-trained en- coders in Table 12a

We reported the corresponding LIC scores in Table 11. We reported DBACA→T scores on the four pre-trained en- coders in Table 12a. We reported the corresponding LIC scores in Table 12b. For race, we reported DBACA→T scores on six sentence encoders trained from scratch, in Table 13. We reported the corresponding LIC scores in Table 14. We reported DBACA→T s...

work page 1941