pith. sign in

arxiv: 1906.10256 · v2 · pith:TECGLZJWnew · submitted 2019-06-24 · 💻 cs.CL

Good Secretaries, Bad Truck Drivers? Occupational Gender Stereotypes in Sentiment Analysis

Pith reviewed 2026-05-25 17:09 UTC · model grok-4.3

classification 💻 cs.CL
keywords occupational gender stereotypessentiment analysisgender biasNLP evaluationprofession datasetsmodel bias testing
0
0 comments X

The pith

Sentiment analysis models assign different scores to the same occupation depending on whether the subject is described as male or female.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper creates a dataset of 800 gender-balanced sentences about specific professions and uses it to test whether sentiment models produce systematically different outputs based on the gender paired with each job. It runs the test on three models and checks whether the observed differences match broader societal views of which occupations are seen as masculine or feminine. A reader would care because these models are increasingly used in applications that evaluate text about people and work. If the differences are real and consistent, the work supplies a repeatable method for detecting and tracking such biases.

Core claim

The authors establish that occupational gender stereotypes appear in sentiment analysis models through measurable differences in sentiment scores on gender-swapped sentences about the same professions, and that the pattern of these differences corresponds to societal perceptions of occupational gender roles.

What carries the argument

A released gender-balanced dataset of 800 sentences about professions, employed as a test bench to isolate sentiment differences attributable to gender-occupation pairings.

If this is right

  • Sentiment models will carry forward and potentially reinforce societal occupational gender stereotypes in any downstream task that processes text about jobs.
  • The degree of bias in a model can be quantified and compared against independent measures of societal occupational stereotypes.
  • Applications that rely on sentiment scores for hiring-related text, reviews, or social media will inherit these occupation-by-gender patterns unless corrected.
  • The released dataset provides a public benchmark that other researchers can apply to additional models or updated versions of existing models.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same sentence-construction approach could be adapted to measure other embedded stereotypes such as those involving race or age in sentiment outputs.
  • Retraining the tested models on data that deliberately balances gender across professions might reduce the measured score differences.
  • The method offers a template for auditing bias in other NLP tasks that assign numerical scores to text involving people and roles.

Load-bearing premise

Differences in model scores on the 800 sentences arise solely from the gender-occupation link rather than from other variations in sentence wording or structure.

What would settle it

Applying the three models to the dataset and finding no consistent, statistically significant difference in average sentiment between the male-subject and female-subject versions of the same profession sentences.

Figures

Figures reproduced from arXiv: 1906.10256 by Isha Bhallamudi, Jayadev Bhaskaran.

Figure 1
Figure 1. Figure 1: Simple diagram of our task definition. representations as building blocks for NLP tasks. The rise of this paradigm is characterized by the use of language models for pretraining, exempli￾fied by models such as ELMo (Peters et al., 2018), ULMFit (Howard and Ruder, 2018), GPT (Rad￾ford, 2018), and BERT (Devlin et al., 2018). These models have shown marked improve￾ments over word vector based approaches for a… view at source ↗
Figure 2
Figure 2. Figure 2: Median weekly earnings (Current Population Survey, 2018) vs. mean predicted positive probability using M.3 (BERT), per profession. of external data. First, we analyze differences in mean positive class probability between sentences with male and female nouns for each profession. We notice that pilot has the highest positive dif￾ference between female and male noun sentences (i.e., female is higher), while … view at source ↗
read the original abstract

In this work, we investigate the presence of occupational gender stereotypes in sentiment analysis models. Such a task has implications for reducing implicit biases in these models, which are being applied to an increasingly wide variety of downstream tasks. We release a new gender-balanced dataset of 800 sentences pertaining to specific professions and propose a methodology for using it as a test bench to evaluate sentiment analysis models. We evaluate the presence of occupational gender stereotypes in 3 different models using our approach, and explore their relationship with societal perceptions of occupations.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 1 minor

Summary. The paper claims that occupational gender stereotypes are present in sentiment analysis models and can be measured using a newly released gender-balanced dataset of 800 profession-related sentences. It proposes a methodology to use this dataset as a test bench, evaluates three models, and relates the observed biases to societal perceptions of occupations.

Significance. If the dataset construction isolates gender-occupation effects without lexical confounds, the work supplies a concrete, reproducible benchmark for quantifying and mitigating implicit biases in sentiment models, which are widely deployed in downstream applications.

major comments (1)
  1. [§4] §4 (Dataset): the claim that the 800-sentence set isolates occupational gender stereotypes requires evidence that male/female sentence pairs differ only in gender markers; no validation is provided that verb choice, objects, sentence length, or profession-specific phrasing are balanced, so sentiment gaps could arise from template artifacts rather than stereotypes.
minor comments (1)
  1. The abstract does not name the three evaluated models or the exact societal-perception data source used for correlation analysis.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their constructive feedback. We address the major comment on dataset validation below.

read point-by-point responses
  1. Referee: [§4] §4 (Dataset): the claim that the 800-sentence set isolates occupational gender stereotypes requires evidence that male/female sentence pairs differ only in gender markers; no validation is provided that verb choice, objects, sentence length, or profession-specific phrasing are balanced, so sentiment gaps could arise from template artifacts rather than stereotypes.

    Authors: We agree that the manuscript does not include explicit quantitative validation of balance across non-gender features. The 800 sentences were generated from a small number of fixed templates per profession, with only the gendered pronoun and the profession noun varied while holding verbs, objects, and overall structure constant within each profession pair. We will revise §4 to describe the template design in detail and add balance statistics (identical sentence lengths within pairs, identical verbs/objects for matched sentences) to demonstrate that sentiment differences arise from gender-occupation associations rather than template artifacts. revision: yes

Circularity Check

0 steps flagged

Empirical measurement study with newly introduced dataset exhibits no circularity

full rationale

The paper releases a new gender-balanced dataset of 800 sentences and applies it to measure occupational gender stereotypes in three sentiment models, relating results to external societal perceptions. No derivation chain, equations, or fitted parameters are present; the central claim rests on the independent construction and evaluation of this dataset rather than reducing to prior inputs, self-citations, or ansatzes. This matches the default expectation for non-circular empirical work.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review; no free parameters, axioms, or invented entities are stated or required by the described contribution.

pith-pipeline@v0.9.0 · 5608 in / 979 out tokens · 29660 ms · 2026-05-25T17:09:42.026405+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

37 extracted references · 37 canonical work pages · 8 internal anchors

  1. [1]

    URL: " 'urlintro :=

    ENTRY address author booktitle chapter edition editor howpublished institution journal key month note number organization pages publisher school series title type volume year eprint doi pubmed url lastchecked label extra.label sort.label short.list INTEGERS output.state before.all mid.sentence after.sentence after.block STRINGS urlintro eprinturl eprintpr...

  2. [2]

    write newline

    " write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION word.in bbl.in capitalize " " * FUNCT...

  3. [3]

    Evaluating the Underlying Gender Bias in Contextualized Word Embeddings

    Christine Basta, Marta R. Costa-juss\`a, and Noe Casas. 2019. http://arxiv.org/abs/1904.08783 Evaluating the U nderlying G ender B ias in C ontextualized W ord E mbeddings . arXiv e-prints

  4. [4]

    Tolga Bolukbasi, Kai-Wei Chang, James Zou, Venkatesh Saligrama, and Adam Kalai. 2016. http://dl.acm.org/citation.cfm?id=3157382.3157584 Man is to C omputer P rogrammer A s W oman is to H omemaker? D ebiasing W ord E mbeddings . In Proceedings of the 30th International Conference on Neural Information Processing Systems, NIPS'16, pages 4356--4364, USA. Cur...

  5. [5]

    Carlo Bonferroni. 1936. Teoria statistica delle classi e calcolo delle probabilita. Pubblicazioni del R Istituto Superiore di Scienze Economiche e Commericiali di Firenze, 8:3--62

  6. [6]

    Bryson, and Arvind Narayanan

    Aylin Caliskan, Joanna J. Bryson, and Arvind Narayanan. 2017. https://doi.org/10.1126/science.aal4230 Semantics derived automatically from language corpora contain human-like biases . Science, 356(6334):183--186

  7. [7]

    Mary Ann Cejka and Alice H. Eagly. 1999. https://doi.org/10.1177/0146167299025004002 Gender- S tereotypic I mages of O ccupations C orrespond to the S ex S egregation of E mployment . Personality and Social Psychology Bulletin, 25(4):413--423

  8. [8]

    Fran c ois Chollet et al. 2015. Keras. https://keras.io

  9. [9]

    Andre Costa and Adriano Veloso. 2015. https://doi.org/10.13140/RG.2.1.1623.3688 Employee analytics through sentiment analysis . In Brazilian Symposium on Databases

  10. [10]

    Current Population Survey . 2018. https://www.bls.gov/cps/cpsaat39.htm 39. Median weekly earnings of full-time wage and salary workers by detailed occupation and sex . Bureau of Labor Statistics, United States Department of Labor

  11. [11]

    Eva Derous and Ann Marie Ryan. 2018. https://doi.org/10.1111/1748-8583.12217 When your resume is (not) turning you down: Modelling ethnic bias in resume screening . Human Resource Management Journal, 29(2):113--130

  12. [12]

    Jacob Devlin, Ming - Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. http://arxiv.org/abs/1810.04805 BERT: P re-training of D eep B idirectional T ransformers for L anguage understanding . CoRR, abs/1810.04805

  13. [13]

    Alice H Eagly and Valerie J Steffen. 1984. Gender stereotypes stem from the distribution of women and men into social roles. Journal of personality and social psychology, 46(4):735

  14. [14]

    Nikhil Garg, Londa Schiebinger, Dan Jurafsky, and James Zou. 2018. https://doi.org/10.1073/pnas.1720347115 Word embeddings quantify 100 years of gender and ethnic stereotypes . Proceedings of the National Academy of Sciences, 115(16):E3635--E3644

  15. [15]

    Peter Glick, Korin Wilk, and Michele Perreault. 1995. https://doi.org/10.1007/BF01544212 Images of occupations: C omponents of gender and status in occupational stereotypes . Sex Roles, 32(9):565--582

  16. [16]

    Hila Gonen and Yoav Goldberg. 2019. http://arxiv.org/abs/1903.03862 Lipstick on a Pig: Debiasing Methods Cover up Systematic Gender Biases in Word Embeddings But do not Remove Them . arXiv e-prints

  17. [17]

    Haines, Kay Deaux, and Nicole Lofaro

    Elizabeth L. Haines, Kay Deaux, and Nicole Lofaro. 2016. https://doi.org/10.1177/0361684316634081 The T imes T hey A re a- C hanging... or A re T hey N ot? A C omparison of G ender S tereotypes, 1983-2014 . Psychology of Women Quarterly, 40(3):353--363

  18. [18]

    Sepp Hochreiter and J\" u rgen Schmidhuber. 1997. https://doi.org/10.1162/neco.1997.9.8.1735 Long S hort- T erm M emory . Neural Comput., 9(8):1735--1780

  19. [19]

    Jeremy Howard and Sebastian Ruder. 2018. http://arxiv.org/abs/1801.06146 Fine-tuned L anguage M odels for T ext C lassification . CoRR, abs/1801.06146

  20. [20]

    Matthew Kay, Cynthia Matuszek, and Sean A. Munson. 2015. https://doi.org/10.1145/2702123.2702520 Unequal representation and gender stereotypes in image search results for occupations . In Proceedings of the 33rd Annual ACM Conference on Human Factors in Computing Systems, CHI '15, pages 3819--3828, New York, NY, USA. ACM

  21. [21]

    Examining Gender and Race Bias in Two Hundred Sentiment Analysis Systems

    Svetlana Kiritchenko and Saif M. Mohammad. 2018. http://arxiv.org/abs/1805.04508 Examining gender and race bias in two hundred sentiment analysis systems . CoRR, abs/1805.04508

  22. [22]

    On Measuring Social Biases in Sentence Encoders

    Chandler May, Alex Wang, Shikha Bordia, Samuel R. Bowman, and Rachel Rudinger. 2019. http://arxiv.org/abs/1903.10561 On measuring social biases in sentence encoders

  23. [23]

    Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. 2013. http://papers.nips.cc/paper/5021-distributed-representations-of-words-and-phrases-and-their-compositionality.pdf Distributed R epresentations of W ords and P hrases and their C ompositionality . In C. J. C. Burges, L. Bottou, M. Welling, Z. Ghahramani, and K. Q. Weinberger, edit...

  24. [24]

    Mohammad, Felipe Bravo-Marquez, Mohammad Salameh, and Svetlana Kiritchenko

    Saif M. Mohammad, Felipe Bravo-Marquez, Mohammad Salameh, and Svetlana Kiritchenko. 2018. Semeval-2018 T ask 1: A ffect in T weets. In Proceedings of International Workshop on Semantic Evaluation (SemEval-2018), New Orleans, LA, USA

  25. [25]

    Astrid Nieuwets. 2015. Fallen Females: On the Semantic Pejoration of Mistress and Spinster . Bachelor's thesis, Utrecht University

  26. [26]

    Pedregosa, G

    F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay. 2011. Scikit-learn: Machine learning in P ython. Journal of Machine Learning Research, 12:2825--2830

  27. [27]

    Jeffrey Pennington, Richard Socher, and Christopher D. Manning. 2014. http://www.aclweb.org/anthology/D14-1162 Glove: G lobal V ectors for W ord R epresentation . In Empirical Methods in Natural Language Processing (EMNLP), pages 1532--1543

  28. [28]

    Deep contextualized word representations

    Matthew E. Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer. 2018. http://arxiv.org/abs/1802.05365 Deep contextualized word representations . CoRR, abs/1802.05365

  29. [29]

    Alec Radford. 2018. Improving L anguage U nderstanding by G enerative P re- T raining

  30. [30]

    Rachel Rudinger, Jason Naradowsky, Brian Leonard, and Benjamin Van Durme. 2018. https://doi.org/10.18653/v1/n18-2002 Gender bias in coreference resolution . Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers)

  31. [31]

    Rudman and Julie E

    Laurie A. Rudman and Julie E. Phelan. 2008. https://doi.org/https://doi.org/10.1016/j.riob.2008.04.003 Backlash effects for disconfirming gender stereotypes in organizations . Research in Organizational Behavior, 28:61 -- 79

  32. [32]

    Eva H Shinar. 1975. https://doi.org/https://doi.org/10.1016/0001-8791(75)90037-8 Sexual stereotypes of occupations . Journal of Vocational Behavior, 7(1):99 -- 111

  33. [33]

    Richard Socher, Alex Perelygin, Jean Wu, Jason Chuang, Christopher Manning, Andrew Ng, and Christopher Potts. 2013. Recursive D eep M odels for S emantic C ompositionality O ver a S entiment T reebank. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 1631--1642

  34. [34]

    Dries Vervecken, Bettina Hannover, and Ilka Wolter. 2013. https://doi.org/https://doi.org/10.1016/j.jvb.2013.01.008 Changing ( S )expectations: How gender fair job descriptions impact children's perceptions and interest regarding traditionally male occupations . Journal of Vocational Behavior, 82(3):208 -- 220

  35. [35]

    Kellie Webster, Marta Recasens, Vera Axelrod, and Jason Baldridge. 2018. https://doi.org/10.1162/tacl_a_00240 Mind the GAP: A Balanced Corpus of Gendered Ambiguous Pronouns . Transactions of the Association for Computational Linguistics, 6:605–617

  36. [36]

    Jieyu Zhao, Tianlu Wang, Mark Yatskar, Ryan Cotterell, Vicente Ordonez, and Kai-Wei Chang. 2019. Gender B ias in C ontextualized W ord E mbeddings. CoRR, abs/1904.03310

  37. [37]

    Jieyu Zhao, Yichao Zhou, Zeyu Li, Wei Wang, and Kai - Wei Chang. 2018. http://arxiv.org/abs/1809.01496 Learning G ender- N eutral W ord E mbeddings . CoRR, abs/1809.01496