Good Secretaries, Bad Truck Drivers? Occupational Gender Stereotypes in Sentiment Analysis
Pith reviewed 2026-05-25 17:09 UTC · model grok-4.3
The pith
Sentiment analysis models assign different scores to the same occupation depending on whether the subject is described as male or female.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The authors establish that occupational gender stereotypes appear in sentiment analysis models through measurable differences in sentiment scores on gender-swapped sentences about the same professions, and that the pattern of these differences corresponds to societal perceptions of occupational gender roles.
What carries the argument
A released gender-balanced dataset of 800 sentences about professions, employed as a test bench to isolate sentiment differences attributable to gender-occupation pairings.
If this is right
- Sentiment models will carry forward and potentially reinforce societal occupational gender stereotypes in any downstream task that processes text about jobs.
- The degree of bias in a model can be quantified and compared against independent measures of societal occupational stereotypes.
- Applications that rely on sentiment scores for hiring-related text, reviews, or social media will inherit these occupation-by-gender patterns unless corrected.
- The released dataset provides a public benchmark that other researchers can apply to additional models or updated versions of existing models.
Where Pith is reading between the lines
- The same sentence-construction approach could be adapted to measure other embedded stereotypes such as those involving race or age in sentiment outputs.
- Retraining the tested models on data that deliberately balances gender across professions might reduce the measured score differences.
- The method offers a template for auditing bias in other NLP tasks that assign numerical scores to text involving people and roles.
Load-bearing premise
Differences in model scores on the 800 sentences arise solely from the gender-occupation link rather than from other variations in sentence wording or structure.
What would settle it
Applying the three models to the dataset and finding no consistent, statistically significant difference in average sentiment between the male-subject and female-subject versions of the same profession sentences.
Figures
read the original abstract
In this work, we investigate the presence of occupational gender stereotypes in sentiment analysis models. Such a task has implications for reducing implicit biases in these models, which are being applied to an increasingly wide variety of downstream tasks. We release a new gender-balanced dataset of 800 sentences pertaining to specific professions and propose a methodology for using it as a test bench to evaluate sentiment analysis models. We evaluate the presence of occupational gender stereotypes in 3 different models using our approach, and explore their relationship with societal perceptions of occupations.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that occupational gender stereotypes are present in sentiment analysis models and can be measured using a newly released gender-balanced dataset of 800 profession-related sentences. It proposes a methodology to use this dataset as a test bench, evaluates three models, and relates the observed biases to societal perceptions of occupations.
Significance. If the dataset construction isolates gender-occupation effects without lexical confounds, the work supplies a concrete, reproducible benchmark for quantifying and mitigating implicit biases in sentiment models, which are widely deployed in downstream applications.
major comments (1)
- [§4] §4 (Dataset): the claim that the 800-sentence set isolates occupational gender stereotypes requires evidence that male/female sentence pairs differ only in gender markers; no validation is provided that verb choice, objects, sentence length, or profession-specific phrasing are balanced, so sentiment gaps could arise from template artifacts rather than stereotypes.
minor comments (1)
- The abstract does not name the three evaluated models or the exact societal-perception data source used for correlation analysis.
Simulated Author's Rebuttal
We thank the referee for their constructive feedback. We address the major comment on dataset validation below.
read point-by-point responses
-
Referee: [§4] §4 (Dataset): the claim that the 800-sentence set isolates occupational gender stereotypes requires evidence that male/female sentence pairs differ only in gender markers; no validation is provided that verb choice, objects, sentence length, or profession-specific phrasing are balanced, so sentiment gaps could arise from template artifacts rather than stereotypes.
Authors: We agree that the manuscript does not include explicit quantitative validation of balance across non-gender features. The 800 sentences were generated from a small number of fixed templates per profession, with only the gendered pronoun and the profession noun varied while holding verbs, objects, and overall structure constant within each profession pair. We will revise §4 to describe the template design in detail and add balance statistics (identical sentence lengths within pairs, identical verbs/objects for matched sentences) to demonstrate that sentiment differences arise from gender-occupation associations rather than template artifacts. revision: yes
Circularity Check
Empirical measurement study with newly introduced dataset exhibits no circularity
full rationale
The paper releases a new gender-balanced dataset of 800 sentences and applies it to measure occupational gender stereotypes in three sentiment models, relating results to external societal perceptions. No derivation chain, equations, or fitted parameters are present; the central claim rests on the independent construction and evaluation of this dataset rather than reducing to prior inputs, self-citations, or ansatzes. This matches the default expectation for non-circular empirical work.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
ENTRY address author booktitle chapter edition editor howpublished institution journal key month note number organization pages publisher school series title type volume year eprint doi pubmed url lastchecked label extra.label sort.label short.list INTEGERS output.state before.all mid.sentence after.sentence after.block STRINGS urlintro eprinturl eprintpr...
-
[2]
" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION word.in bbl.in capitalize " " * FUNCT...
-
[3]
Evaluating the Underlying Gender Bias in Contextualized Word Embeddings
Christine Basta, Marta R. Costa-juss\`a, and Noe Casas. 2019. http://arxiv.org/abs/1904.08783 Evaluating the U nderlying G ender B ias in C ontextualized W ord E mbeddings . arXiv e-prints
work page internal anchor Pith review Pith/arXiv arXiv 2019
-
[4]
Tolga Bolukbasi, Kai-Wei Chang, James Zou, Venkatesh Saligrama, and Adam Kalai. 2016. http://dl.acm.org/citation.cfm?id=3157382.3157584 Man is to C omputer P rogrammer A s W oman is to H omemaker? D ebiasing W ord E mbeddings . In Proceedings of the 30th International Conference on Neural Information Processing Systems, NIPS'16, pages 4356--4364, USA. Cur...
-
[5]
Carlo Bonferroni. 1936. Teoria statistica delle classi e calcolo delle probabilita. Pubblicazioni del R Istituto Superiore di Scienze Economiche e Commericiali di Firenze, 8:3--62
work page 1936
-
[6]
Aylin Caliskan, Joanna J. Bryson, and Arvind Narayanan. 2017. https://doi.org/10.1126/science.aal4230 Semantics derived automatically from language corpora contain human-like biases . Science, 356(6334):183--186
-
[7]
Mary Ann Cejka and Alice H. Eagly. 1999. https://doi.org/10.1177/0146167299025004002 Gender- S tereotypic I mages of O ccupations C orrespond to the S ex S egregation of E mployment . Personality and Social Psychology Bulletin, 25(4):413--423
-
[8]
Fran c ois Chollet et al. 2015. Keras. https://keras.io
work page 2015
-
[9]
Andre Costa and Adriano Veloso. 2015. https://doi.org/10.13140/RG.2.1.1623.3688 Employee analytics through sentiment analysis . In Brazilian Symposium on Databases
-
[10]
Current Population Survey . 2018. https://www.bls.gov/cps/cpsaat39.htm 39. Median weekly earnings of full-time wage and salary workers by detailed occupation and sex . Bureau of Labor Statistics, United States Department of Labor
work page 2018
-
[11]
Eva Derous and Ann Marie Ryan. 2018. https://doi.org/10.1111/1748-8583.12217 When your resume is (not) turning you down: Modelling ethnic bias in resume screening . Human Resource Management Journal, 29(2):113--130
-
[12]
Jacob Devlin, Ming - Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. http://arxiv.org/abs/1810.04805 BERT: P re-training of D eep B idirectional T ransformers for L anguage understanding . CoRR, abs/1810.04805
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[13]
Alice H Eagly and Valerie J Steffen. 1984. Gender stereotypes stem from the distribution of women and men into social roles. Journal of personality and social psychology, 46(4):735
work page 1984
-
[14]
Nikhil Garg, Londa Schiebinger, Dan Jurafsky, and James Zou. 2018. https://doi.org/10.1073/pnas.1720347115 Word embeddings quantify 100 years of gender and ethnic stereotypes . Proceedings of the National Academy of Sciences, 115(16):E3635--E3644
-
[15]
Peter Glick, Korin Wilk, and Michele Perreault. 1995. https://doi.org/10.1007/BF01544212 Images of occupations: C omponents of gender and status in occupational stereotypes . Sex Roles, 32(9):565--582
- [16]
-
[17]
Haines, Kay Deaux, and Nicole Lofaro
Elizabeth L. Haines, Kay Deaux, and Nicole Lofaro. 2016. https://doi.org/10.1177/0361684316634081 The T imes T hey A re a- C hanging... or A re T hey N ot? A C omparison of G ender S tereotypes, 1983-2014 . Psychology of Women Quarterly, 40(3):353--363
-
[18]
Sepp Hochreiter and J\" u rgen Schmidhuber. 1997. https://doi.org/10.1162/neco.1997.9.8.1735 Long S hort- T erm M emory . Neural Comput., 9(8):1735--1780
-
[19]
Jeremy Howard and Sebastian Ruder. 2018. http://arxiv.org/abs/1801.06146 Fine-tuned L anguage M odels for T ext C lassification . CoRR, abs/1801.06146
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[20]
Matthew Kay, Cynthia Matuszek, and Sean A. Munson. 2015. https://doi.org/10.1145/2702123.2702520 Unequal representation and gender stereotypes in image search results for occupations . In Proceedings of the 33rd Annual ACM Conference on Human Factors in Computing Systems, CHI '15, pages 3819--3828, New York, NY, USA. ACM
-
[21]
Examining Gender and Race Bias in Two Hundred Sentiment Analysis Systems
Svetlana Kiritchenko and Saif M. Mohammad. 2018. http://arxiv.org/abs/1805.04508 Examining gender and race bias in two hundred sentiment analysis systems . CoRR, abs/1805.04508
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[22]
On Measuring Social Biases in Sentence Encoders
Chandler May, Alex Wang, Shikha Bordia, Samuel R. Bowman, and Rachel Rudinger. 2019. http://arxiv.org/abs/1903.10561 On measuring social biases in sentence encoders
work page internal anchor Pith review Pith/arXiv arXiv 2019
-
[23]
Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. 2013. http://papers.nips.cc/paper/5021-distributed-representations-of-words-and-phrases-and-their-compositionality.pdf Distributed R epresentations of W ords and P hrases and their C ompositionality . In C. J. C. Burges, L. Bottou, M. Welling, Z. Ghahramani, and K. Q. Weinberger, edit...
work page 2013
-
[24]
Mohammad, Felipe Bravo-Marquez, Mohammad Salameh, and Svetlana Kiritchenko
Saif M. Mohammad, Felipe Bravo-Marquez, Mohammad Salameh, and Svetlana Kiritchenko. 2018. Semeval-2018 T ask 1: A ffect in T weets. In Proceedings of International Workshop on Semantic Evaluation (SemEval-2018), New Orleans, LA, USA
work page 2018
-
[25]
Astrid Nieuwets. 2015. Fallen Females: On the Semantic Pejoration of Mistress and Spinster . Bachelor's thesis, Utrecht University
work page 2015
-
[26]
F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay. 2011. Scikit-learn: Machine learning in P ython. Journal of Machine Learning Research, 12:2825--2830
work page 2011
-
[27]
Jeffrey Pennington, Richard Socher, and Christopher D. Manning. 2014. http://www.aclweb.org/anthology/D14-1162 Glove: G lobal V ectors for W ord R epresentation . In Empirical Methods in Natural Language Processing (EMNLP), pages 1532--1543
work page 2014
-
[28]
Deep contextualized word representations
Matthew E. Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer. 2018. http://arxiv.org/abs/1802.05365 Deep contextualized word representations . CoRR, abs/1802.05365
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[29]
Alec Radford. 2018. Improving L anguage U nderstanding by G enerative P re- T raining
work page 2018
-
[30]
Rachel Rudinger, Jason Naradowsky, Brian Leonard, and Benjamin Van Durme. 2018. https://doi.org/10.18653/v1/n18-2002 Gender bias in coreference resolution . Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers)
-
[31]
Laurie A. Rudman and Julie E. Phelan. 2008. https://doi.org/https://doi.org/10.1016/j.riob.2008.04.003 Backlash effects for disconfirming gender stereotypes in organizations . Research in Organizational Behavior, 28:61 -- 79
-
[32]
Eva H Shinar. 1975. https://doi.org/https://doi.org/10.1016/0001-8791(75)90037-8 Sexual stereotypes of occupations . Journal of Vocational Behavior, 7(1):99 -- 111
-
[33]
Richard Socher, Alex Perelygin, Jean Wu, Jason Chuang, Christopher Manning, Andrew Ng, and Christopher Potts. 2013. Recursive D eep M odels for S emantic C ompositionality O ver a S entiment T reebank. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 1631--1642
work page 2013
-
[34]
Dries Vervecken, Bettina Hannover, and Ilka Wolter. 2013. https://doi.org/https://doi.org/10.1016/j.jvb.2013.01.008 Changing ( S )expectations: How gender fair job descriptions impact children's perceptions and interest regarding traditionally male occupations . Journal of Vocational Behavior, 82(3):208 -- 220
-
[35]
Kellie Webster, Marta Recasens, Vera Axelrod, and Jason Baldridge. 2018. https://doi.org/10.1162/tacl_a_00240 Mind the GAP: A Balanced Corpus of Gendered Ambiguous Pronouns . Transactions of the Association for Computational Linguistics, 6:605–617
-
[36]
Jieyu Zhao, Tianlu Wang, Mark Yatskar, Ryan Cotterell, Vicente Ordonez, and Kai-Wei Chang. 2019. Gender B ias in C ontextualized W ord E mbeddings. CoRR, abs/1904.03310
work page internal anchor Pith review Pith/arXiv arXiv 2019
-
[37]
Jieyu Zhao, Yichao Zhou, Zeyu Li, Wei Wang, and Kai - Wei Chang. 2018. http://arxiv.org/abs/1809.01496 Learning G ender- N eutral W ord E mbeddings . CoRR, abs/1809.01496
work page internal anchor Pith review Pith/arXiv arXiv 2018
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.