Fairness-Aware Multi-Group Target Detection in Online Discussion

Maria De-Arteaga; Matthew Lease; Soumyajit Gupta

arxiv: 2407.11933 · v5 · submitted 2024-07-16 · 💻 cs.LG

Fairness-Aware Multi-Group Target Detection in Online Discussion

Soumyajit Gupta , Maria De-Arteaga , Matthew Lease This is my paper

Pith reviewed 2026-05-23 22:38 UTC · model grok-4.3

classification 💻 cs.LG

keywords fairnesstarget group detectiontoxicity detectionmulti-groupbias reductiononline discussionmulti-label classificationmachine learning

0 comments

The pith

A fairness-aware approach for detecting multiple target groups in social media posts reduces bias across demographic groups while maintaining strong predictive performance.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a method to detect which group or groups a post targets, a task relevant to toxicity detection because harm depends on the specific demographic targeted. A single post can target multiple groups at once, and the method must deliver consistent accuracy for each group to avoid unfair outcomes. By adding fairness constraints to a multi-label classifier, the approach lowers measured bias compared with prior fairness-aware methods and keeps high overall accuracy. This matters for platforms that moderate content or assess targeted harm, where biased detection could lead to inconsistent enforcement. The authors release code to support further work on the task.

Core claim

The authors present a fairness-aware multi-group target detection model that jointly detects multiple target groups and enforces fairness across groups in the context of toxicity detection. They demonstrate that this model reduces bias across demographic groups compared to existing fairness-aware baselines while achieving strong predictive performance.

What carries the argument

The fairness-aware multi-group target detection approach, which integrates fairness constraints into multi-label classification for identifying which demographic groups a post targets.

If this is right

Toxicity detection systems can achieve lower bias across groups without sacrificing detection accuracy.
Multi-label classification for target groups becomes feasible under explicit fairness constraints.
Existing fairness-aware baselines can be outperformed on both bias reduction and predictive metrics.
Releasing code enables direct replication and extension to other online discussion tasks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same fairness integration could be tested on recommendation or marketing tasks that also involve multi-group targeting.
If the fairness metrics align with downstream harm, the method may reduce real-world disparities in content moderation.
Similar constraint-based training might apply to other contextual language tasks where accuracy must hold across subgroups.

Load-bearing premise

The fairness constraints and evaluation metrics used accurately reflect real-world fairness requirements in toxicity detection across demographic groups.

What would settle it

A test on a held-out dataset with new demographic groups or a live deployment where the method shows higher bias than the baselines it claims to surpass would falsify the central claim.

Figures

Figures reproduced from arXiv: 2407.11933 by Maria De-Arteaga, Matthew Lease, Soumyajit Gupta.

**Figure 1.** Figure 1: Summary statistics of the MHS corpus [37] show the distribution of posts targeting demographic groups. The Black community is the statistical majority, while Native American and Pacific Islander are statistical minorities. Additionally, the dataset includes posts targeting multiple groups, reflecting its multi-group nature. 7.2 Neural Model and Baseline Measure For our neural model ( [PITH_FULL_IMAGE:figu… view at source ↗

**Figure 2.** Figure 2: Our multi target-group detection architecture. The model has shared parameters to learn both general and [PITH_FULL_IMAGE:figures/full_fig_p008_2.png] view at source ↗

**Figure 3.** Figure 3: Visualization of the BA values achieved by each loss over the 7 demographic groups. The maximum difference [PITH_FULL_IMAGE:figures/full_fig_p010_3.png] view at source ↗

**Figure 4.** Figure 4: Heatmap of pairwise absolute difference of BA across groups in test set as an indicator for bias and disparate [PITH_FULL_IMAGE:figures/full_fig_p011_4.png] view at source ↗

read the original abstract

Target-group detection is the task of detecting which group(s) a piece of content is ``directed at or about''. Applications include targeted marketing, content recommendation, and group-specific content assessment. Key challenges include: 1) that a single post may target multiple groups; and 2) ensuring consistent detection accuracy across groups for fairness. In this work, we investigate fairness implications of target-group detection in the context of toxicity detection, where the perceived harm of a social media post often depends on which group(s) it targets. Because toxicity is highly contextual, language that appears benign in general can be harmful when targeting specific demographic groups. We show our {\em fairness-aware multi-group target detection} approach both reduces bias across groups and shows strong predictive performance, surpassing existing fairness-aware baselines. To enable reproducibility and spur future work, we share our code online.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper claims a fairness-aware multi-group target detection approach that reduces bias in toxicity detection while beating baselines, but the abstract leaves the multi-label fairness details unclear.

read the letter

The punchline is that this work claims a fairness-aware multi-group target detection method for toxicity that reduces bias and outperforms baselines, but the abstract alone does not give enough to verify the multi-label handling. What is new is the focus on posts that can target several groups simultaneously while trying to keep detection fair across those groups. This is a reasonable extension of fairness work in toxicity detection. The paper does well to highlight that language can be harmful depending on the target group and to release the code. The soft spots center on the fairness enforcement in the multi-group case. The stress-test note raises a fair point: if fairness is applied marginally without modeling joint targets, the bias reduction might look good on single-group examples but fall short when groups co-occur. The abstract presents the results as empirical but gives no indication of how the constraints or metrics address overlaps. This makes it hard to know if the surpassing of baselines reflects real improvement or just the evaluation setup. Since the full methods are not visible here, the soundness cannot be fully assessed, but the abstract does not show signs of circularity. This paper would interest researchers in AI fairness for online platforms and content moderation. Readers looking for practical applications in social media might get value from the approach if the experiments are solid. It shows honest engagement with the literature on contextual toxicity. I would recommend sending it for peer review. The topic is relevant and the multi-group angle is worth checking out in detail.

Referee Report

1 major / 0 minor

Summary. The manuscript proposes a fairness-aware multi-group target detection approach for online discussions, with application to toxicity detection. It claims that the method reduces bias across demographic groups while achieving strong predictive performance that surpasses existing fairness-aware baselines. The work emphasizes challenges from multi-label targeting (a post may target multiple groups) and shares code for reproducibility.

Significance. If substantiated with detailed methods and results, the work would be significant for fair ML in content moderation by addressing multi-group targeting, a common but under-modeled aspect of contextual toxicity. The reproducibility commitment via shared code is a clear strength.

major comments (1)

The central claim of bias reduction across groups while surpassing baselines rests on the fairness constraints and evaluation in the multi-label setting. Without evidence that constraints are applied jointly rather than marginally per group, apparent gains on single-group cases may not extend to co-occurring targets, risking that performance improvements are metric artifacts rather than genuine fairness gains.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their thoughtful review and for highlighting the importance of substantiating the joint application of fairness constraints in the multi-label setting. We address this major comment below and maintain that the manuscript already provides the necessary evidence through its method formulation and evaluation design.

read point-by-point responses

Referee: The central claim of bias reduction across groups while surpassing baselines rests on the fairness constraints and evaluation in the multi-label setting. Without evidence that constraints are applied jointly rather than marginally per group, apparent gains on single-group cases may not extend to co-occurring targets, risking that performance improvements are metric artifacts rather than genuine fairness gains.

Authors: We appreciate this observation, as the multi-label nature of target detection is central to the work. Our fairness-aware approach formulates the constraints jointly across groups within a multi-task objective that explicitly models co-occurring targets (detailed in Section 3). The loss incorporates terms that penalize disparities while accounting for label combinations, rather than treating groups marginally. Evaluation results, including breakdowns on posts with multiple targets (Table 3 and Figure 4), demonstrate that bias reduction and performance gains persist in these cases, indicating the improvements are not artifacts of single-group metrics. We are happy to expand the method description for further clarity if the editor deems it necessary. revision: no

Circularity Check

0 steps flagged

No circularity; empirical ML evaluation is self-contained

full rationale

The paper presents a fairness-aware method for multi-group target detection and reports empirical results showing reduced bias and improved performance over baselines. No derivation chain, equations, or predictions are described that reduce by construction to fitted inputs, self-definitions, or self-citation chains. Claims rest on standard experimental comparisons rather than any load-bearing self-referential step. This is the expected outcome for an applied ML paper whose central assertions are falsifiable via external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No details available from abstract on free parameters, axioms, or invented entities; ledger left empty.

pith-pipeline@v0.9.0 · 5675 in / 922 out tokens · 18667 ms · 2026-05-23T22:38:15.864448+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

54 extracted references · 54 canonical work pages

[1]

Julia Angwin, Jeff Larson, Surya Mattu, and Lauren Kirchner. 2022. Machine bias. In Ethics of data and analytics. Auerbach Publications, 254–264. 13

work page 2022
[2]

Richard Berk, Hoda Heidari, Shahin Jabbari, Michael Kearns, and Aaron Roth. 2021. Fairness in criminal justice risk assessments: The state of the art. Sociological Methods & Research 50, 1 (2021), 3–44

work page 2021
[3]

Emily Black, Manish Raghavan, and Solon Barocas. 2022. Model multiplicity: Opportunities, concerns, and solutions. In Proceedings of the 2022 ACM Conference on Fairness, Accountability, and Transparency. 850–863

work page 2022
[4]

Su Lin Blodgett, Johnny Wei, and Brendan O’Connor. 2017. A dataset and classifier for recognizing social media english. In Proceedings of the 3rd Workshop on Noisy User-generated Text. 56–61

work page 2017
[5]

Luke Breitfeller, Emily Ahn, David Jurgens, and Yulia Tsvetkov. 2019. Finding microaggressions in the wild: A case for locating elusive phenomena in social media posts. In Proceedings of the 2019 conference on empirical methods in natural language processing and the 9th international joint conference on natural language processing (EMNLP-IJCNLP). 1664–1674

work page 2019
[6]

Tong Chen, Danny Wang, Xurong Liang, Marten Risius, Gianluca Demartini, and Hongzhi Yin. 2024. Hate Speech Detection with Generalizable Target-aware Fairness. InProceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining. 365–375

work page 2024
[7]

François Chollet. 2015. keras. https://github.com/fchollet/keras

work page 2015
[8]

Alexandra Chouldechova. 2017. Fair prediction with disparate impact: A study of bias in recidivism prediction instruments. Big Data 5, 2 (2017), 153–163

work page 2017
[9]

Sanjiv Das, Michele Donini, Jason Gelman, Kevin Haas, Mila Hardt, Jared Katzman, Krishnaram Kenthapadi, Pedro Larroy, Pinar Yilmaz, and Muhammad Bilal Zafar. 2021. Fairness Measures for Machine Learning in Finance. The Journal of Financial Data Science 3, 4 (2021), 33–64

work page 2021
[10]

Thomas Davidson, Debasmita Bhattacharya, and Ingmar Weber. 2019. Racial Bias in Hate Speech and Abusive Language Detection Datasets. In Proceedings of the Third Workshop on Abusive Language Online. 25–35

work page 2019
[11]

Thomas Davidson, Dana Warmsley, Michael Macy, and Ingmar Weber. 2017. Automated hate speech detection and the problem of offensive language. In Proceedings of the International AAAI Conference on Web and Social Media

work page 2017
[12]

William Dieterich, Christina Mendoza, and Tim Brennan. 2016. COMPAS risk scales: Demonstrating accuracy equity and predictive parity. Northpointe Inc 7, 4 (2016)

work page 2016
[13]

Hugging Face. 2022. ucberkeley-dlab/measuring-hate-speech. https://huggingface.co/datasets/ ucberkeley-dlab/measuring-hate-speech

work page 2022
[14]

Eve Fleisig, Rediet Abebe, and Dan Klein. 2023. When the Majority is Wrong: Modeling Annotator Disagreement for Subjective Tasks. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing. 6715–6726

work page 2023
[15]

Paula Fortuna and Sérgio Nunes. 2018. A survey on automatic detection of hate speech in text. ACM Computing Surveys (CSUR) 51, 4 (2018), 1–30

work page 2018
[16]

Antigoni-Maria Founta, Constantinos Djouvas, Despoina Chatzakou, Ilias Leontiadis, Jeremy Blackburn, Gianluca Stringhini, Athena Vakali, Michael Sirivianos, and Nicolas Kourtellis. 2018. Large scale crowdsourcing and characterization of Twitter abusive behavior. https://open.bu.edu/handle/2144/40119

work page 2018
[17]

Mitchell L Gordon, Michelle S Lam, Joon Sung Park, Kayur Patel, Jeff Hancock, Tatsunori Hashimoto, and Michael S Bernstein. 2022. Jury learning: Integrating dissenting voices into machine learning models. In Proceedings of the 2022 CHI Conference on Human Factors in Computing Systems. 1–19

work page 2022
[18]

Soumyajit Gupta, Venelin Kovatchev, Anubrata Das, Maria De-Arteaga, and Matthew Lease. 2025. Finding Pareto trade-offs in fair and accurate detection of toxic speech. Information Research 30, iConference 2025 (Mar. 2025), 123–141. https://doi.org/10.47989/ir30iConf47572

work page doi:10.47989/ir30iconf47572 2025
[19]

Soumyajit Gupta, Sooyong Lee, Maria De-Arteaga, and Matthew Lease. 2023. Same Same, But Different: Conditional Multi-Task Learning for Demographic-Specific Toxicity Detection. InProceedings of the ACM Web Conference 2023. 3689–3700

work page 2023
[20]

Moritz Hardt, Eric Price, and Nati Srebro. 2016. Equality of opportunity in supervised learning. Advances in neural information processing systems 29 (2016)

work page 2016
[21]

Hoda Heidari, Michele Loi, Krishna P Gummadi, and Andreas Krause. 2019. A moral framework for understanding fair ml through economic models of equality of opportunity. In Proceedings of the conference on fairness, accountability, and transparency. 181–190

work page 2019
[22]

Francisco Herrera, Francisco Charte, Antonio J Rivera, María J Del Jesus, Francisco Herrera, Francisco Charte, Antonio J Rivera, and María J del Jesus. 2016. Multilabel classification. Springer. 14

work page 2016
[23]

Brendan Kennedy, Xisen Jin, Aida Mostafazadeh Davani, Morteza Dehghani, and Xiang Ren. 2020. Contextualiz- ing Hate Speech Classifiers with Post-hoc Explanation. In Proceedings of the Annual Meeting of the Association for Computational Linguistics

work page 2020
[24]

Diederik P Kingma and Jimmy Ba. 2015. Adam: A method for stochastic optimization. InInternational Conference on Learning Representations (ICLR)

work page 2015
[25]

Jon Kleinberg, Sendhil Mullainathan, and Manish Raghavan. 2017. Inherent Trade-Offs in the Fair Determination of Risk Scores. In 8th Innovations in Theoretical Computer Science Conference (ITCS 2017). Schloss Dagstuhl– Leibniz-Zentrum für Informatik, 43–1

work page 2017
[26]

Hadas Kotek, Rikker Dockum, and David Sun. 2023. Gender bias and stereotypes in large language models. In Proceedings of the ACM collective intelligence conference. 12–24

work page 2023
[27]

Deepak Kumar, Patrick Gage Kelley, Sunny Consolvo, Joshua Mason, Elie Bursztein, Zakir Durumeric, Kurt Thomas, and Michael Bailey. 2021. Designing toxic content classification for a diversity of perspectives. In Seventeenth Symposium on Usable Privacy and Security (SOUPS 2021). 299–318

work page 2021
[28]

Nina Markl. 2022. Language variation and algorithmic bias: understanding algorithmic bias in British English automatic speech recognition. In Proceedings of the 2022 ACM Conference on Fairness, Accountability, and Transparency. 521–534

work page 2022
[29]

Andrew Kachites McCallum. 1999. Multi-label text classification with a mixture model trained by EM. InAAAI’99 workshop on text learning

work page 1999
[30]

Donald Metzler and W.˜Bruce Croft. 2007. Linear feature-based models for information retrieval. Information Retrieval 10, 3 (2007), 257–274

work page 2007
[31]

Shira Mitchell, Eric Potash, Solon Barocas, Alexander D’Amour, and Kristian Lum. 2021. Algorithmic fairness: Choices, assumptions, and definitions. Annual Review of Statistics and Its Application 8 (2021), 141–163

work page 2021
[32]

William Morgan, Warren Greiff, and John Henderson. 2004. Direct maximization of average precision by hill- climbing, with a comparison to a maximum entropy approach. In Proceedings of HLT-NAACL 2004: Short Papers. 93–96

work page 2004
[33]

Roberto Navigli, Simone Conia, and Björn Ross. 2023. Biases in large language models: origins, inventory, and discussion. ACM Journal of Data and Information Quality 15, 2 (2023), 1–21

work page 2023
[34]

Terrence Neumann, Maria De-Arteaga, and Sina Fazelpour. 2022. Justice in Misinformation Detection Systems: An Analysis of Algorithms, Stakeholders, and Potential Harms. In Proceedings of the 2022 ACM Conference on Fairness, Accountability, and Transparency(Seoul, Republic of Korea) (FAccT ’22). Association for Computing Machinery, New York, NY , USA, 1504...

work page doi:10.1145/3531146.3533205 2022
[35]

Pedregosa, G

F. Pedregosa, G. Varoquaux, A. Gramfort, V . Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V . Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay. 2011. Scikit-learn: Machine Learning in Python. Journal of Machine Learning Research 12 (2011), 2825–2830

work page 2011
[36]

Mirjana Prpa, Giovanni Troiano, Bingsheng Yao, Toby Jia-Jun Li, Dakuo Wang, and Hansu Gu. 2024. Challenges and Opportunities of LLM-Based Synthetic Personae and Data in HCI. In Companion Publication of the 2024 Conference on Computer-Supported Cooperative Work and Social Computing (San Jose, Costa Rica) (CSCW Companion ’24). Association for Computing Mach...

work page doi:10.1145/3678884.3681826 2024
[37]

Pratik S Sachdeva, Renata Barreto, Claudia von Vacano, and Chris J Kennedy. 2022. Assessing annotator identity sensitivity via item response theory: A case study in a hate speech corpus. In Proceedings of the 2022 ACM Conference on Fairness, Accountability, and Transparency. 1585–1603

work page 2022
[38]

Victor Sanh, Lysandre Debut, Julien Chaumond, and Thomas Wolf. 2019. DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter. In Advances in Neural Information Processing Systems (NeurIPS) Workshop on Energy Efficient Machine Learning and Cognitive Computing

work page 2019
[39]

Maarten Sap, Dallas Card, Saadia Gabriel, Yejin Choi, and Noah A Smith. 2019. The risk of racial bias in hate speech detection. In Proceedings of the 57th annual meeting of the association for computational linguistics . 1668–1678

work page 2019
[40]

Andrew D Selbst, Danah Boyd, Sorelle A Friedler, Suresh Venkatasubramanian, and Janet Vertesi. 2019. Fairness and abstraction in sociotechnical systems. In Proceedings of the conference on fairness, accountability, and transparency. 59–68. 15

work page 2019
[41]

Aili Shen, Xudong Han, Trevor Cohn, Timothy Baldwin, and Lea Frermann. 2022. Optimising Equal Opportunity Fairness in Model Training. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies . Association for Computational Linguistics, Seattle, United States, 4073–408...

work page doi:10.18653/v1/2022.naacl-main 2022
[42]

Hedderich, Bartłomiej Jakub Rey, Andrés Lucero, and Antti Oulasvirta

Joongi Shin, Michael A. Hedderich, Bartłomiej Jakub Rey, Andrés Lucero, and Antti Oulasvirta. 2024. Under- standing Human-AI Workflows for Generating Personas. InProceedings of the 2024 ACM Designing Interactive Systems Conference (Copenhagen, Denmark) (DIS ’24). Association for Computing Machinery, New York, NY , USA, 757–781. https://doi.org/10.1145/364...

work page doi:10.1145/3643834.3660729 2024
[43]

Robin Swezey, Aditya Grover, Bruno Charron, and Stefano Ermon. 2021. Pirank: Scalable learning to rank via differentiable sorting. Advances in Neural Information Processing Systems 34 (2021), 21644–21654

work page 2021
[44]

Bertie Vidgen and Leon Derczynski. 2020. Directions in abusive language training data, a systematic review: Garbage in, garbage out. Plos one 15, 12 (2020), e0243300

work page 2020
[45]

Zhenhailong Wang, Shaoguang Mao, Wenshan Wu, Tao Ge, Furu Wei, and Heng Ji. 2024. Unleashing the Emergent Cognitive Synergy in Large Language Models: A Task-Solving Agent through Multi-Persona Self- Collaboration. In Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologi...

work page doi:10.18653/v1/2024.naacl-long.15 2024
[46]

Zeerak Waseem and Dirk Hovy. 2016. Hateful symbols or hateful people? predictive features for hate speech detection on twitter. In Proceedings of the NAACL student research workshop. 88–93

work page 2016
[47]

Guoqiang Wu and Jun Zhu. 2020. Multi-label classification: do Hamming loss and subset accuracy really conflict with each other? Advances in Neural Information Processing Systems 33 (2020), 3130–3140

work page 2020
[48]

Ellery Wulczyn, Nithum Thain, and Lucas Dixon. 2017. Ex machina: Personal attacks seen at scale. InProceedings of the 26th international conference on world wide web. 1391–1399

work page 2017
[49]

Mengzhou Xia, Anjalie Field, and Yulia Tsvetkov. 2020. Demoting Racial Bias in Hate Speech Detection. In Proceedings of the Eighth International Workshop on Natural Language Processing for Social Media. Association for Computational Linguistics, Online, 7–14. https://doi.org/10.18653/v1/2020.socialnlp-1. 2

work page doi:10.18653/v1/2020.socialnlp-1 2020
[50]

Zhe Yu, Joymallya Chakraborty, and Tim Menzies. 2024. FairBalance: How to Achieve Equalized Odds With Data Pre-processing. IEEE Transactions on Software Engineering (2024)

work page 2024
[51]

Yisong Yue, Thomas Finley, Filip Radlinski, and Thorsten Joachims. 2007. A support vector method for optimizing average precision. In Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval. 271–278

work page 2007
[52]

Rich Zemel, Yu Wu, Kevin Swersky, Toni Pitassi, and Cynthia Dwork. 2013. Learning fair representations. In International conference on machine learning. PMLR, 325–333

work page 2013
[53]

Han Zhao, Amanda Coston, Tameem Adel, and Geoffrey J Gordon. 2020. Conditional Learning of Fair Represen- tations. In 8th International Conference on Learning Representations, ICLR 2020. A Impossibility Proofs Assume two groups A and B. Let the number of positive and negative examples in group A be PA and NA respectively. Similarly, PB and NB for group B ...

work page 2020
[54]

We notice similar issues of convergence instability as they observed in Xia et al

is an approximate adversarial loss for balancing FPR rates across groups. We notice similar issues of convergence instability as they observed in Xia et al. [49] as well. Consequently, let ADV run for a fixed epochs and report the best BA value achieved over iterations. Balanced Accuracy (BA) Loss Latinx Middle Eastern Avg. BA Diff. OE 90.98 83.67 87.33 7...

work page

[1] [1]

Julia Angwin, Jeff Larson, Surya Mattu, and Lauren Kirchner. 2022. Machine bias. In Ethics of data and analytics. Auerbach Publications, 254–264. 13

work page 2022

[2] [2]

Richard Berk, Hoda Heidari, Shahin Jabbari, Michael Kearns, and Aaron Roth. 2021. Fairness in criminal justice risk assessments: The state of the art. Sociological Methods & Research 50, 1 (2021), 3–44

work page 2021

[3] [3]

Emily Black, Manish Raghavan, and Solon Barocas. 2022. Model multiplicity: Opportunities, concerns, and solutions. In Proceedings of the 2022 ACM Conference on Fairness, Accountability, and Transparency. 850–863

work page 2022

[4] [4]

Su Lin Blodgett, Johnny Wei, and Brendan O’Connor. 2017. A dataset and classifier for recognizing social media english. In Proceedings of the 3rd Workshop on Noisy User-generated Text. 56–61

work page 2017

[5] [5]

Luke Breitfeller, Emily Ahn, David Jurgens, and Yulia Tsvetkov. 2019. Finding microaggressions in the wild: A case for locating elusive phenomena in social media posts. In Proceedings of the 2019 conference on empirical methods in natural language processing and the 9th international joint conference on natural language processing (EMNLP-IJCNLP). 1664–1674

work page 2019

[6] [6]

Tong Chen, Danny Wang, Xurong Liang, Marten Risius, Gianluca Demartini, and Hongzhi Yin. 2024. Hate Speech Detection with Generalizable Target-aware Fairness. InProceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining. 365–375

work page 2024

[7] [7]

François Chollet. 2015. keras. https://github.com/fchollet/keras

work page 2015

[8] [8]

Alexandra Chouldechova. 2017. Fair prediction with disparate impact: A study of bias in recidivism prediction instruments. Big Data 5, 2 (2017), 153–163

work page 2017

[9] [9]

Sanjiv Das, Michele Donini, Jason Gelman, Kevin Haas, Mila Hardt, Jared Katzman, Krishnaram Kenthapadi, Pedro Larroy, Pinar Yilmaz, and Muhammad Bilal Zafar. 2021. Fairness Measures for Machine Learning in Finance. The Journal of Financial Data Science 3, 4 (2021), 33–64

work page 2021

[10] [10]

Thomas Davidson, Debasmita Bhattacharya, and Ingmar Weber. 2019. Racial Bias in Hate Speech and Abusive Language Detection Datasets. In Proceedings of the Third Workshop on Abusive Language Online. 25–35

work page 2019

[11] [11]

Thomas Davidson, Dana Warmsley, Michael Macy, and Ingmar Weber. 2017. Automated hate speech detection and the problem of offensive language. In Proceedings of the International AAAI Conference on Web and Social Media

work page 2017

[12] [12]

William Dieterich, Christina Mendoza, and Tim Brennan. 2016. COMPAS risk scales: Demonstrating accuracy equity and predictive parity. Northpointe Inc 7, 4 (2016)

work page 2016

[13] [13]

Hugging Face. 2022. ucberkeley-dlab/measuring-hate-speech. https://huggingface.co/datasets/ ucberkeley-dlab/measuring-hate-speech

work page 2022

[14] [14]

Eve Fleisig, Rediet Abebe, and Dan Klein. 2023. When the Majority is Wrong: Modeling Annotator Disagreement for Subjective Tasks. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing. 6715–6726

work page 2023

[15] [15]

Paula Fortuna and Sérgio Nunes. 2018. A survey on automatic detection of hate speech in text. ACM Computing Surveys (CSUR) 51, 4 (2018), 1–30

work page 2018

[16] [16]

Antigoni-Maria Founta, Constantinos Djouvas, Despoina Chatzakou, Ilias Leontiadis, Jeremy Blackburn, Gianluca Stringhini, Athena Vakali, Michael Sirivianos, and Nicolas Kourtellis. 2018. Large scale crowdsourcing and characterization of Twitter abusive behavior. https://open.bu.edu/handle/2144/40119

work page 2018

[17] [17]

Mitchell L Gordon, Michelle S Lam, Joon Sung Park, Kayur Patel, Jeff Hancock, Tatsunori Hashimoto, and Michael S Bernstein. 2022. Jury learning: Integrating dissenting voices into machine learning models. In Proceedings of the 2022 CHI Conference on Human Factors in Computing Systems. 1–19

work page 2022

[18] [18]

Soumyajit Gupta, Venelin Kovatchev, Anubrata Das, Maria De-Arteaga, and Matthew Lease. 2025. Finding Pareto trade-offs in fair and accurate detection of toxic speech. Information Research 30, iConference 2025 (Mar. 2025), 123–141. https://doi.org/10.47989/ir30iConf47572

work page doi:10.47989/ir30iconf47572 2025

[19] [19]

Soumyajit Gupta, Sooyong Lee, Maria De-Arteaga, and Matthew Lease. 2023. Same Same, But Different: Conditional Multi-Task Learning for Demographic-Specific Toxicity Detection. InProceedings of the ACM Web Conference 2023. 3689–3700

work page 2023

[20] [20]

Moritz Hardt, Eric Price, and Nati Srebro. 2016. Equality of opportunity in supervised learning. Advances in neural information processing systems 29 (2016)

work page 2016

[21] [21]

Hoda Heidari, Michele Loi, Krishna P Gummadi, and Andreas Krause. 2019. A moral framework for understanding fair ml through economic models of equality of opportunity. In Proceedings of the conference on fairness, accountability, and transparency. 181–190

work page 2019

[22] [22]

Francisco Herrera, Francisco Charte, Antonio J Rivera, María J Del Jesus, Francisco Herrera, Francisco Charte, Antonio J Rivera, and María J del Jesus. 2016. Multilabel classification. Springer. 14

work page 2016

[23] [23]

Brendan Kennedy, Xisen Jin, Aida Mostafazadeh Davani, Morteza Dehghani, and Xiang Ren. 2020. Contextualiz- ing Hate Speech Classifiers with Post-hoc Explanation. In Proceedings of the Annual Meeting of the Association for Computational Linguistics

work page 2020

[24] [24]

Diederik P Kingma and Jimmy Ba. 2015. Adam: A method for stochastic optimization. InInternational Conference on Learning Representations (ICLR)

work page 2015

[25] [25]

Jon Kleinberg, Sendhil Mullainathan, and Manish Raghavan. 2017. Inherent Trade-Offs in the Fair Determination of Risk Scores. In 8th Innovations in Theoretical Computer Science Conference (ITCS 2017). Schloss Dagstuhl– Leibniz-Zentrum für Informatik, 43–1

work page 2017

[26] [26]

Hadas Kotek, Rikker Dockum, and David Sun. 2023. Gender bias and stereotypes in large language models. In Proceedings of the ACM collective intelligence conference. 12–24

work page 2023

[27] [27]

Deepak Kumar, Patrick Gage Kelley, Sunny Consolvo, Joshua Mason, Elie Bursztein, Zakir Durumeric, Kurt Thomas, and Michael Bailey. 2021. Designing toxic content classification for a diversity of perspectives. In Seventeenth Symposium on Usable Privacy and Security (SOUPS 2021). 299–318

work page 2021

[28] [28]

Nina Markl. 2022. Language variation and algorithmic bias: understanding algorithmic bias in British English automatic speech recognition. In Proceedings of the 2022 ACM Conference on Fairness, Accountability, and Transparency. 521–534

work page 2022

[29] [29]

Andrew Kachites McCallum. 1999. Multi-label text classification with a mixture model trained by EM. InAAAI’99 workshop on text learning

work page 1999

[30] [30]

Donald Metzler and W.˜Bruce Croft. 2007. Linear feature-based models for information retrieval. Information Retrieval 10, 3 (2007), 257–274

work page 2007

[31] [31]

Shira Mitchell, Eric Potash, Solon Barocas, Alexander D’Amour, and Kristian Lum. 2021. Algorithmic fairness: Choices, assumptions, and definitions. Annual Review of Statistics and Its Application 8 (2021), 141–163

work page 2021

[32] [32]

William Morgan, Warren Greiff, and John Henderson. 2004. Direct maximization of average precision by hill- climbing, with a comparison to a maximum entropy approach. In Proceedings of HLT-NAACL 2004: Short Papers. 93–96

work page 2004

[33] [33]

Roberto Navigli, Simone Conia, and Björn Ross. 2023. Biases in large language models: origins, inventory, and discussion. ACM Journal of Data and Information Quality 15, 2 (2023), 1–21

work page 2023

[34] [34]

Terrence Neumann, Maria De-Arteaga, and Sina Fazelpour. 2022. Justice in Misinformation Detection Systems: An Analysis of Algorithms, Stakeholders, and Potential Harms. In Proceedings of the 2022 ACM Conference on Fairness, Accountability, and Transparency(Seoul, Republic of Korea) (FAccT ’22). Association for Computing Machinery, New York, NY , USA, 1504...

work page doi:10.1145/3531146.3533205 2022

[35] [35]

Pedregosa, G

F. Pedregosa, G. Varoquaux, A. Gramfort, V . Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V . Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay. 2011. Scikit-learn: Machine Learning in Python. Journal of Machine Learning Research 12 (2011), 2825–2830

work page 2011

[36] [36]

Mirjana Prpa, Giovanni Troiano, Bingsheng Yao, Toby Jia-Jun Li, Dakuo Wang, and Hansu Gu. 2024. Challenges and Opportunities of LLM-Based Synthetic Personae and Data in HCI. In Companion Publication of the 2024 Conference on Computer-Supported Cooperative Work and Social Computing (San Jose, Costa Rica) (CSCW Companion ’24). Association for Computing Mach...

work page doi:10.1145/3678884.3681826 2024

[37] [37]

Pratik S Sachdeva, Renata Barreto, Claudia von Vacano, and Chris J Kennedy. 2022. Assessing annotator identity sensitivity via item response theory: A case study in a hate speech corpus. In Proceedings of the 2022 ACM Conference on Fairness, Accountability, and Transparency. 1585–1603

work page 2022

[38] [38]

Victor Sanh, Lysandre Debut, Julien Chaumond, and Thomas Wolf. 2019. DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter. In Advances in Neural Information Processing Systems (NeurIPS) Workshop on Energy Efficient Machine Learning and Cognitive Computing

work page 2019

[39] [39]

Maarten Sap, Dallas Card, Saadia Gabriel, Yejin Choi, and Noah A Smith. 2019. The risk of racial bias in hate speech detection. In Proceedings of the 57th annual meeting of the association for computational linguistics . 1668–1678

work page 2019

[40] [40]

Andrew D Selbst, Danah Boyd, Sorelle A Friedler, Suresh Venkatasubramanian, and Janet Vertesi. 2019. Fairness and abstraction in sociotechnical systems. In Proceedings of the conference on fairness, accountability, and transparency. 59–68. 15

work page 2019

[41] [41]

Aili Shen, Xudong Han, Trevor Cohn, Timothy Baldwin, and Lea Frermann. 2022. Optimising Equal Opportunity Fairness in Model Training. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies . Association for Computational Linguistics, Seattle, United States, 4073–408...

work page doi:10.18653/v1/2022.naacl-main 2022

[42] [42]

Hedderich, Bartłomiej Jakub Rey, Andrés Lucero, and Antti Oulasvirta

Joongi Shin, Michael A. Hedderich, Bartłomiej Jakub Rey, Andrés Lucero, and Antti Oulasvirta. 2024. Under- standing Human-AI Workflows for Generating Personas. InProceedings of the 2024 ACM Designing Interactive Systems Conference (Copenhagen, Denmark) (DIS ’24). Association for Computing Machinery, New York, NY , USA, 757–781. https://doi.org/10.1145/364...

work page doi:10.1145/3643834.3660729 2024

[43] [43]

Robin Swezey, Aditya Grover, Bruno Charron, and Stefano Ermon. 2021. Pirank: Scalable learning to rank via differentiable sorting. Advances in Neural Information Processing Systems 34 (2021), 21644–21654

work page 2021

[44] [44]

Bertie Vidgen and Leon Derczynski. 2020. Directions in abusive language training data, a systematic review: Garbage in, garbage out. Plos one 15, 12 (2020), e0243300

work page 2020

[45] [45]

Zhenhailong Wang, Shaoguang Mao, Wenshan Wu, Tao Ge, Furu Wei, and Heng Ji. 2024. Unleashing the Emergent Cognitive Synergy in Large Language Models: A Task-Solving Agent through Multi-Persona Self- Collaboration. In Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologi...

work page doi:10.18653/v1/2024.naacl-long.15 2024

[46] [46]

Zeerak Waseem and Dirk Hovy. 2016. Hateful symbols or hateful people? predictive features for hate speech detection on twitter. In Proceedings of the NAACL student research workshop. 88–93

work page 2016

[47] [47]

Guoqiang Wu and Jun Zhu. 2020. Multi-label classification: do Hamming loss and subset accuracy really conflict with each other? Advances in Neural Information Processing Systems 33 (2020), 3130–3140

work page 2020

[48] [48]

Ellery Wulczyn, Nithum Thain, and Lucas Dixon. 2017. Ex machina: Personal attacks seen at scale. InProceedings of the 26th international conference on world wide web. 1391–1399

work page 2017

[49] [49]

Mengzhou Xia, Anjalie Field, and Yulia Tsvetkov. 2020. Demoting Racial Bias in Hate Speech Detection. In Proceedings of the Eighth International Workshop on Natural Language Processing for Social Media. Association for Computational Linguistics, Online, 7–14. https://doi.org/10.18653/v1/2020.socialnlp-1. 2

work page doi:10.18653/v1/2020.socialnlp-1 2020

[50] [50]

Zhe Yu, Joymallya Chakraborty, and Tim Menzies. 2024. FairBalance: How to Achieve Equalized Odds With Data Pre-processing. IEEE Transactions on Software Engineering (2024)

work page 2024

[51] [51]

Yisong Yue, Thomas Finley, Filip Radlinski, and Thorsten Joachims. 2007. A support vector method for optimizing average precision. In Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval. 271–278

work page 2007

[52] [52]

Rich Zemel, Yu Wu, Kevin Swersky, Toni Pitassi, and Cynthia Dwork. 2013. Learning fair representations. In International conference on machine learning. PMLR, 325–333

work page 2013

[53] [53]

Han Zhao, Amanda Coston, Tameem Adel, and Geoffrey J Gordon. 2020. Conditional Learning of Fair Represen- tations. In 8th International Conference on Learning Representations, ICLR 2020. A Impossibility Proofs Assume two groups A and B. Let the number of positive and negative examples in group A be PA and NA respectively. Similarly, PB and NB for group B ...

work page 2020

[54] [54]

We notice similar issues of convergence instability as they observed in Xia et al

is an approximate adversarial loss for balancing FPR rates across groups. We notice similar issues of convergence instability as they observed in Xia et al. [49] as well. Consequently, let ADV run for a fixed epochs and report the best BA value achieved over iterations. Balanced Accuracy (BA) Loss Latinx Middle Eastern Avg. BA Diff. OE 90.98 83.67 87.33 7...

work page