Hidden in Plain Sight For Too Long: Using Text Mining Techniques to Shine a Light on Workplace Sexism and Sexual Harassment

Amir Karami; Cynthia Nicole White; Kayla Ford; Suzanne C. Swan

arxiv: 1907.00510 · v1 · pith:ZKM5W524new · submitted 2019-07-01 · 💻 cs.CY · cs.CL· stat.AP

Hidden in Plain Sight For Too Long: Using Text Mining Techniques to Shine a Light on Workplace Sexism and Sexual Harassment

Amir Karami , Suzanne C. Swan , Cynthia Nicole White , Kayla Ford This is my paper

Pith reviewed 2026-05-25 12:02 UTC · model grok-4.3

classification 💻 cs.CY cs.CLstat.AP

keywords text miningworkplace sexismsexual harassmenttopic extractionsex discriminationgender harassmentunwanted sexual attention

0 comments

The pith

Text mining of 2362 online workplace reports extracts 23 topics grouped into three literature-derived themes of sexism and harassment.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper applies a computational text mining framework to a large set of personal accounts of workplace experiences to identify recurring patterns of sexism and sexual harassment. The authors extract 23 topics from the data and code them into three overarching themes drawn from prior research on sex discrimination and sexual harassment. The first theme covers unfavorable treatment based on sex such as denied promotions or lower pay. The second combines sex discrimination with gender harassment including insults and bullying tied to stereotypes. The third focuses on unwanted sexual attention such as degrading comments and physical contact, with touching emerging as the most frequent specific experience. This method shows how automatic analysis of naturally occurring large datasets can extend understanding beyond what small-scale traditional studies typically achieve.

Core claim

The paper establishes that text mining applied to 2362 posted workplace experiences generates 23 topics that researchers can code and group into three themes from the sex discrimination and sexual harassment literature: Sex Discrimination including being passed over for promotion, denied opportunities, paid less, or ignored in meetings; Sex Discrimination and Gender harassment covering sexist hostility such as insults, jokes invoking misogynistic stereotypes, and bullying; and Unwanted Sexual Attention describing sexual comments and behaviors to degrade women, where the topic of unwanted touching carries the highest weight indicating its commonality.

What carries the argument

A computational text mining framework that collects posted experiences and extracts topics for subsequent qualitative coding and mapping onto three literature themes.

If this is right

Automatic processes enable investigation of naturally occurring large-scale internet datasets on workplace sexism beyond the limits of traditional methods.
Unwanted touching is shown as the highest-weighted topic among reported experiences of sexual attention.
Sex discrimination appears in concrete forms such as promotion denials, pay disparities, and being talked over in meetings.
Gender harassment includes behaviors ranging from misogynistic jokes and insults to bullying.
The three themes provide a structured way to categorize everyday experiences drawn from real posted accounts.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same text mining approach could be repeated on updated collections of reports to track whether the frequency of specific topics shifts over time.
High-weight topics such as unwanted touching could guide the design of targeted workplace training or reporting mechanisms.
If applied to reports from multiple online sources the method might reveal whether theme distributions differ by industry or location.
Alignment between extracted topics and established literature themes could support using online data as a complement to survey-based studies of harassment prevalence.

Load-bearing premise

The topics generated by the text mining method accurately capture the content of the experiences and can be validly grouped by researchers into the three literature themes without substantial bias or omission.

What would settle it

Independent researchers manually coding a representative sample of the same reports and obtaining substantially different topics or theme groupings would indicate the mapping does not hold.

read the original abstract

Objective: The goal of this study is to understand how people experience sexism and sexual harassment in the workplace by discovering themes in 2,362 experiences posted on the Everyday Sexism Project's website everydaysexism.com. Method: This study used both quantitative and qualitative methods. The quantitative method was a computational framework to collect and analyze a large number of workplace sexual harassment experiences. The qualitative method was the analysis of the topics generated by a text mining method. Results: Twenty-three topics were coded and then grouped into three overarching themes from the sex discrimination and sexual harassment literature. The Sex Discrimination theme included experiences in which women were treated unfavorably due to their sex, such as being passed over for promotion, denied opportunities, paid less than men, and ignored or talked over in meetings. The Sex Discrimination and Gender harassment theme included stories about sex discrimination and gender harassment, such as sexist hostility behaviors ranging from insults and jokes invoking misogynistic stereotypes to bullying behavior. The last theme, Unwanted Sexual Attention, contained stories describing sexual comments and behaviors used to degrade women. Unwanted touching was the highest weighted topic, indicating how common it was for website users to endure being touched, hugged or kissed, groped, and grabbed. Conclusions: This study illustrates how researchers can use automatic processes to go beyond the limits of traditional research methods and investigate naturally occurring large scale datasets on the internet to achieve a better understanding of everyday workplace sexism experiences.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 1 minor

Summary. The paper collects 2,362 workplace sexism and sexual harassment experiences posted on the Everyday Sexism Project website and applies a computational text mining framework to extract 23 topics. These topics are then qualitatively coded by the authors and grouped into three overarching themes drawn from the existing sex discrimination and sexual harassment literature: Sex Discrimination (e.g., being passed over for promotion or paid less), Sex Discrimination and Gender Harassment (e.g., insults and bullying invoking misogynistic stereotypes), and Unwanted Sexual Attention (e.g., sexual comments and touching, with unwanted touching as the highest-weighted topic). The study concludes that such automatic processes enable analysis of large-scale naturally occurring online datasets beyond the limits of traditional research methods.

Significance. If the topic model produces coherent groupings and the qualitative mapping is reproducible, the work illustrates a practical mixed-methods pipeline for scaling qualitative insight into sensitive workplace experiences using public web data. It supplies concrete illustrations of theme prevalence (e.g., unwanted touching) that could inform future survey design or policy.

major comments (3)

[Method] Method section: the description of the 'computational framework' provides no information on the specific text-mining algorithm, preprocessing pipeline, how the number of topics was chosen, or any model diagnostics (coherence, topic-word distributions, or stability checks). Without these, it is impossible to assess whether the 23 topics are data-driven or artifacts of parameter choices.
[Results] Results section: the qualitative coding step that maps the 23 topics onto the three literature-derived themes reports no inter-coder reliability statistics, coding protocol, or safeguards against confirmation bias. Because the themes originate in prior literature and are applied post-hoc, the absence of these details undermines the claim that the topics 'accurately capture' the posted experiences.
[Conclusions] The central claim that the method 'achieve[s] a better understanding of everyday workplace sexism experiences' rests on the validity of both the unsupervised topic extraction and the subsequent human mapping; the manuscript supplies no quantitative or qualitative evidence that either step succeeded.

minor comments (1)

[Abstract] The abstract states 'Twenty-three topics were coded' but does not clarify whether this refers to the raw output of the topic model or to a post-processed selection; a brief clarification would improve readability.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for these detailed and constructive comments on the methodological transparency of our work. We address each point below and commit to revisions that strengthen the manuscript without altering its core findings or claims.

read point-by-point responses

Referee: [Method] Method section: the description of the 'computational framework' provides no information on the specific text-mining algorithm, preprocessing pipeline, how the number of topics was chosen, or any model diagnostics (coherence, topic-word distributions, or stability checks). Without these, it is impossible to assess whether the 23 topics are data-driven or artifacts of parameter choices.

Authors: We agree that the current Method section is insufficiently detailed. In the revised manuscript we will expand it to specify the topic modeling algorithm (LDA), the full preprocessing pipeline (including tokenization, lemmatization, and stop-word removal), the procedure used to select 23 topics (coherence-based model selection across a range of k values), and report standard diagnostics such as topic coherence scores, top word distributions per topic, and any stability checks performed. revision: yes
Referee: [Results] Results section: the qualitative coding step that maps the 23 topics onto the three literature-derived themes reports no inter-coder reliability statistics, coding protocol, or safeguards against confirmation bias. Because the themes originate in prior literature and are applied post-hoc, the absence of these details undermines the claim that the topics 'accurately capture' the posted experiences.

Authors: We accept this criticism. The revised Results section will include a dedicated subsection describing the coding protocol, the independent coding process used by the authors, how disagreements were resolved, inter-coder reliability statistics (e.g., Cohen’s kappa), and explicit steps taken to reduce confirmation bias such as initial open coding before mapping to the three literature-derived themes. revision: yes
Referee: [Conclusions] The central claim that the method 'achieve[s] a better understanding of everyday workplace sexism experiences' rests on the validity of both the unsupervised topic extraction and the subsequent human mapping; the manuscript supplies no quantitative or qualitative evidence that either step succeeded.

Authors: The manuscript does present the 23 topics with illustrative excerpts and their grouping into the three themes, which constitutes qualitative face-validity evidence. Nevertheless, we agree that stronger validation is required. In revision we will add the quantitative diagnostics noted above, additional qualitative examples demonstrating topic coherence, and will moderate the Conclusions language to emphasize the exploratory and illustrative nature of the pipeline rather than claiming definitive superiority over traditional methods. revision: partial

Circularity Check

0 steps flagged

No significant circularity identified

full rationale

The paper collects 2362 posts, applies text mining to surface 23 topics, and then qualitatively codes and maps those topics onto three pre-existing themes drawn from the sex discrimination and sexual harassment literature. This is a standard two-stage pipeline (unsupervised topic extraction followed by human interpretive coding against external literature categories). No equations, fitted parameters, self-citations, or uniqueness claims appear in the provided text that would reduce the reported themes or topic groupings to the inputs by construction. The central result is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The analysis rests on the assumption that unsupervised topic modeling will surface themes that align with established categories in the sexism literature, plus the choice of topic count and manual coding rules.

free parameters (1)

number_of_topics = 23
Chosen as 23 and then grouped into themes; no justification or sensitivity analysis visible in abstract.

axioms (1)

domain assumption Topics produced by the text mining method correspond to meaningful real-world experiences of sexism that can be coded into literature categories.
Invoked when results are grouped into the three themes.

pith-pipeline@v0.9.0 · 5806 in / 1022 out tokens · 53848 ms · 2026-05-25T12:02:54.400850+00:00 · methodology

Hidden in Plain Sight For Too Long: Using Text Mining Techniques to Shine a Light on Workplace Sexism and Sexual Harassment

Core claim

What carries the argument

If this is right

Where Pith is reading between the lines

Load-bearing premise

What would settle it

discussion (0)