pith. machine review for the scientific record. sign in

arxiv: 2110.08193 · v2 · submitted 2021-10-15 · 💻 cs.CL

Recognition: 2 theorem links

· Lean Theorem

BBQ: A Hand-Built Bias Benchmark for Question Answering

Authors on Pith no claims yet

Pith reviewed 2026-05-16 19:52 UTC · model grok-4.3

classification 💻 cs.CL
keywords bias benchmarkquestion answeringsocial stereotypesNLP fairnessprotected classesmodel evaluationgender bias
0
0 comments X

The pith

Question answering models rely on social stereotypes, showing higher accuracy when correct answers align with biases.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents the Bias Benchmark for QA (BBQ), a dataset of hand-constructed questions targeting social biases across nine dimensions in U.S. English contexts. It evaluates models in under-informative contexts to measure bias reliance and in informative contexts to check if biases override correct answers. Models reproduce harmful stereotypes in ambiguous settings and achieve up to 3.4 percentage points higher accuracy when the right answer matches a bias, with larger gaps for gender. This demonstrates how biases affect practical NLP applications like QA.

Core claim

Models often rely on stereotypes when the context is under-informative, meaning the model's outputs consistently reproduce harmful biases in this setting. Though models are more accurate when the context provides an informative answer, they still rely on stereotypes and average up to 3.4 percentage points higher accuracy when the correct answer aligns with a social bias than when it conflicts, with this difference widening to over 5 points on examples targeting gender for most models tested.

What carries the argument

The BBQ dataset of question sets that highlight attested social biases against protected classes along nine social dimensions.

If this is right

  • Models produce biased outputs in under-informative contexts by defaulting to stereotypes.
  • Accuracy is consistently higher when the correct answer aligns with a social bias.
  • The bias effect is stronger on gender-targeted examples, exceeding 5 percentage points for most models.
  • BBQ can be used to evaluate and track bias in QA models over time.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Developers of QA systems should test for bias in ambiguous scenarios common in real use.
  • The benchmark could help identify which dimensions cause the most bias for specific models.
  • Extending BBQ to other languages might reveal cultural differences in model biases.

Load-bearing premise

The hand-constructed questions accurately capture attested real-world social biases in U.S. English contexts without introducing artificial patterns that models exploit differently from natural text.

What would settle it

A test showing no difference in model accuracy between bias-aligned and bias-conflicting correct answers in the informative context setting.

read the original abstract

It is well documented that NLP models learn social biases, but little work has been done on how these biases manifest in model outputs for applied tasks like question answering (QA). We introduce the Bias Benchmark for QA (BBQ), a dataset of question sets constructed by the authors that highlight attested social biases against people belonging to protected classes along nine social dimensions relevant for U.S. English-speaking contexts. Our task evaluates model responses at two levels: (i) given an under-informative context, we test how strongly responses reflect social biases, and (ii) given an adequately informative context, we test whether the model's biases override a correct answer choice. We find that models often rely on stereotypes when the context is under-informative, meaning the model's outputs consistently reproduce harmful biases in this setting. Though models are more accurate when the context provides an informative answer, they still rely on stereotypes and average up to 3.4 percentage points higher accuracy when the correct answer aligns with a social bias than when it conflicts, with this difference widening to over 5 points on examples targeting gender for most models tested.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper introduces the Bias Benchmark for QA (BBQ), a hand-constructed dataset of question sets targeting attested social biases across nine dimensions in U.S. English contexts. It evaluates QA models in under-informative contexts (to measure stereotype reliance in outputs) and adequately informative contexts (to test whether biases override correct answers), finding that models reproduce harmful stereotypes in ambiguous settings and show an average 3.4 percentage point accuracy advantage (widening to over 5 points for gender) when the correct answer aligns with a social bias.

Significance. If the empirical measurements hold, BBQ offers a valuable applied benchmark for how social biases manifest in QA outputs rather than isolated representations, directly addressing a gap in task-specific bias evaluation. The reported accuracy gaps provide falsifiable, quantitative evidence of persistent stereotype effects even with informative context, which could guide targeted mitigation work.

major comments (3)
  1. [Dataset construction] Dataset construction section: the manuscript provides no inter-annotator agreement statistics or validation procedure for the hand-authored contexts, questions, and answer choices. Without these, the reliability of the bias dimension labels and the claim that items capture attested rather than author-specific patterns cannot be assessed.
  2. [Results] Results section (discussion of accuracy gaps): no statistical testing, confidence intervals, or significance assessment is reported for the 3.4 pp overall gap or the >5 pp gender gap. Given the modest size of the effect, this omission weakens the central claim that models 'still rely on stereotypes' in informative contexts.
  3. [Evaluation] Evaluation and analysis sections: the paper does not include any comparison or external validation showing that the hand-built items match the distribution or linguistic properties of naturally occurring biased QA instances in U.S. English. This leaves open the possibility that measured effects partly reflect construction artifacts rather than real-world social biases.
minor comments (2)
  1. [Abstract] Abstract: specify the total number of models evaluated and list the exact nine social dimensions for clarity.
  2. [Figures/Tables] Figure and table captions: ensure all bias dimensions and model names are fully spelled out rather than abbreviated.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their thoughtful review and constructive comments, as well as the positive recommendation for minor revision. We address each major comment below.

read point-by-point responses
  1. Referee: [Dataset construction] Dataset construction section: the manuscript provides no inter-annotator agreement statistics or validation procedure for the hand-authored contexts, questions, and answer choices. Without these, the reliability of the bias dimension labels and the claim that items capture attested rather than author-specific patterns cannot be assessed.

    Authors: The BBQ dataset was hand-constructed by the authors based on attested biases documented in the social science literature rather than crowd-sourced annotations, so standard inter-annotator agreement statistics do not apply. We will revise the dataset construction section to provide a more detailed description of the item creation process, the specific references used to ground each bias dimension, and the internal consistency checks performed to ensure items target documented social patterns rather than author-specific ones. revision: yes

  2. Referee: [Results] Results section (discussion of accuracy gaps): no statistical testing, confidence intervals, or significance assessment is reported for the 3.4 pp overall gap or the >5 pp gender gap. Given the modest size of the effect, this omission weakens the central claim that models 'still rely on stereotypes' in informative contexts.

    Authors: We agree that statistical assessment would strengthen the results. In the revised manuscript we will add bootstrap-derived 95% confidence intervals for all reported accuracy differences (including the overall 3.4 pp gap and the gender gap exceeding 5 pp) along with p-values from paired significance tests to evaluate whether the observed differences are statistically reliable. revision: yes

  3. Referee: [Evaluation] Evaluation and analysis sections: the paper does not include any comparison or external validation showing that the hand-built items match the distribution or linguistic properties of naturally occurring biased QA instances in U.S. English. This leaves open the possibility that measured effects partly reflect construction artifacts rather than real-world social biases.

    Authors: We acknowledge the value of such validation but note that BBQ is deliberately constructed as a controlled, targeted benchmark to isolate specific bias effects rather than to replicate the distribution of naturally occurring QA data. We will add an explicit limitations paragraph discussing this design choice and the trade-off between controlled construction and ecological validity, while retaining the grounding in attested biases from prior literature. revision: partial

Circularity Check

0 steps flagged

No circularity: empirical measurements on newly constructed benchmark dataset

full rationale

The paper introduces the BBQ dataset via hand-authored question sets targeting attested U.S. English social biases across nine dimensions. All reported results consist of direct accuracy and stereotype-rate measurements on held-out model evaluations in under-informative versus informative contexts. No equations, fitted parameters, or derived predictions appear; the central claims are statistical aggregates over the authors' own test items rather than reductions of any quantity to prior fitted values or self-citations. The evaluation chain is therefore self-contained empirical testing and does not reduce to its inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The paper relies on standard dataset-construction practices and previously attested social biases documented in psychology and sociology literature; no new free parameters, mathematical axioms, or postulated entities are introduced.

pith-pipeline@v0.9.0 · 5515 in / 1045 out tokens · 26400 ms · 2026-05-16T19:52:09.470499+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 18 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Social Bias in LLM-Generated Code: Benchmark and Mitigation

    cs.SE 2026-05 unverdicted novelty 7.0

    LLMs show up to 60.58% social bias in generated code; a new Fairness Monitor Agent cuts bias by 65.1% and raises functional correctness from 75.80% to 83.97%.

  2. BIASEDTALES-ML: A Multilingual Dataset for Analyzing Narrative Attribute Distributions in LLM-Generated Stories

    cs.CL 2026-04 unverdicted novelty 7.0

    BiasedTales-ML provides a parallel multilingual corpus of LLM-generated children's stories that reveals substantial cross-lingual differences in narrative attributes not captured by English-centric analyses.

  3. Navigating the Sea of LLM Evaluation: Investigating Bias in Toxicity Benchmarks

    cs.AI 2026-05 unverdicted novelty 6.0

    Toxicity benchmarks for LLMs produce inconsistent results when task type, input domain, or model changes, revealing intrinsic evaluation biases.

  4. In-Situ Behavioral Evaluation for LLM Fairness, Not Standardized-Test Scores

    cs.CL 2026-04 unverdicted novelty 6.0

    Standardized-test benchmarks for LLM fairness are unreliable because prompt wording alone drives most score variance and ranking changes, while a multi-agent conversational framework reveals consistent model-specific ...

  5. Parcae: Scaling Laws For Stable Looped Language Models

    cs.LG 2026-04 unverdicted novelty 6.0

    Parcae stabilizes looped LLMs via spectral norm constraints on injection parameters, enabling power-law scaling for training FLOPs and saturating exponential scaling at test time that improves quality over fixed-depth...

  6. Aligned Agents, Biased Swarm: Measuring Bias Amplification in Multi-Agent Systems

    cs.MA 2026-04 unverdicted novelty 6.0

    Multi-agent systems amplify minor stochastic biases into systemic polarization via echo-chamber effects in structured workflows, even with neutral agents.

  7. An Independent Safety Evaluation of Kimi K2.5

    cs.CR 2026-04 conditional novelty 6.0

    Kimi K2.5 matches closed models on dual-use tasks but refuses fewer CBRNE requests and shows some sabotage and self-replication tendencies.

  8. Blind Refusal: Language Models Refuse to Help Users Evade Unjust, Absurd, and Illegitimate Rules

    cs.AI 2026-04 unverdicted novelty 6.0

    Language models refuse 75.4% of requests to evade defeated rules and do so even after recognizing reasons that undermine the rule's legitimacy.

  9. Language Models (Mostly) Know What They Know

    cs.CL 2022-07 unverdicted novelty 6.0

    Language models show good calibration when asked to estimate the probability that their own answers are correct, with performance improving as models get larger.

  10. Emergent Abilities of Large Language Models

    cs.CL 2022-06 unverdicted novelty 6.0

    Emergent abilities are capabilities present in large language models but absent in smaller ones and cannot be predicted by extrapolating smaller model performance.

  11. Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback

    cs.CL 2022-04 unverdicted novelty 6.0

    RLHF alignment training on language models boosts NLP performance, supports skill specialization, enables weekly online updates with fresh human data, and shows a linear relation between RL reward and sqrt(KL divergen...

  12. Representational Harms in LLM-Generated Narratives Against Global Majority Nationalities

    cs.CL 2026-04 unverdicted novelty 5.0

    LLMs generate narratives containing persistent stereotypes, erasure, and one-dimensional portrayals of Global Majority national identities, with minoritized groups overrepresented in subordinated roles by more than fi...

  13. Intersectional Fairness in Large Language Models

    cs.CL 2026-04 unverdicted novelty 5.0

    LLMs are more accurate when answers match stereotypes in clear contexts, especially for race-gender combinations, and no tested model shows consistent fairness or reliability across intersectional groups.

  14. gpt-oss-120b & gpt-oss-20b Model Card

    cs.CL 2025-08 unverdicted novelty 5.0

    OpenAI releases two open-weight reasoning models, gpt-oss-120b and gpt-oss-20b, trained via distillation and RL with claimed strong results on math, coding, and safety benchmarks.

  15. PaLM 2 Technical Report

    cs.CL 2023-05 unverdicted novelty 5.0

    PaLM 2 reports state-of-the-art results on language, reasoning, and multilingual tasks with improved efficiency over PaLM.

  16. OpenAI GPT-5 System Card

    cs.CL 2025-12 unverdicted novelty 3.0

    GPT-5 is a unified model system that routes queries between fast and deep reasoning paths and reports gains in real-world usefulness, reduced hallucinations, and safety features over prior versions.

  17. Fairness in Multi-Agent Systems for Software Engineering: An SDLC-Oriented Rapid Review

    cs.SE 2026-04 unverdicted novelty 2.0

    A rapid review of fairness in LLM-enabled multi-agent systems for the software development lifecycle concludes that the field lacks standardized evaluations, broad coverage, and effective governance, leaving it unprep...

  18. The nextAI Solution to the NeurIPS 2023 LLM Efficiency Challenge

    cs.LG 2026-04 unverdicted novelty 2.0

    A competition entry achieved efficient fine-tuning of LLaMa2 70B on one GPU in 24 hours with competitive QA benchmark performance.

Reference graph

Works this paper leans on

65 extracted references · 65 canonical work pages · cited by 18 Pith papers · 3 internal anchors

  1. [1]

    Kevin Bartz. 2009. https://blogs.iq.harvard.edu/english_first_n English first names for chinese americans . Harvard University Social Science Statistics Blog. Accessed July 2021

  2. [2]

    Su Lin Blodgett, Solon Barocas, Hal Daum \'e III, and Hanna Wallach. 2020. https://aclanthology.org/2020.acl-main.485/ Language (technology) is power: A critical survey of" bias" in NLP . In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 5454--5476

  3. [5]

    Bryson, and Arvind Narayanan

    Aylin Caliskan, Joanna J. Bryson, and Arvind Narayanan. 2017. https://doi.org/10.1126/science.aal4230 Semantics derived automatically from language corpora contain human-like biases . Science, 356(6334):183--186

  4. [7]

    Jorida Cila, Richard N Lalonde, Joni Y Sasaki, Raymond A Mar, and Ronda F Lo. 2021. https://psycnet.apa.org/fulltext/2020-69298-001.html Zahra or Zoe , Arjun or Andrew ? Bicultural baby names reflect identity and pragmatic concerns. Cultural Diversity and Ethnic Minority Psychology, 27(3):307

  5. [8]

    Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. 2018. https://arxiv.org/abs/1803.05457 Think you have solved question answering? T ry ARC , the AI2 reasoning challenge . arXiv preprint arXiv:1803.05457

  6. [9]

    Kate Crawford. 2017. The trouble with bias. http://youtube.com/watch?v=fMym_BKWQzk. Talk given at NeurIPS December 2017

  7. [10]

    Rajeev Darolia, Cory Koedel, Paco Martorell, Katie Wilson, and Francisco Perez-Arce. 2016. https://www.tandfonline.com/doi/full/10.1080/13504851.2015.1114571 Race and gender effects on employer interest in job applicants: new evidence from a resume field experiment . Applied Economics Letters, 23(12):853--856

  8. [13]

    U.S. EEOC. 2021. https://www.eeoc.gov/prohibited-employment-policiespractices Prohibited employment policies/practices . Accessed August 2021

  9. [14]

    Pengcheng He, Jianfeng Gao, and Weizhu Chen. 2021. https://arxiv.org/abs/2111.09543 DeBERTaV3 : Improving DeBERTa using ELECTRA -style pre-training with gradient-disentangled embedding sharing . arXiv preprint arXiv:2111.09543

  10. [15]

    Joseph Kasof. 1993. https://psycnet.apa.org/record/1993-16088-001 Sex bias in the naming of stimulus persons. Psychological bulletin, 113(1):140

  11. [16]

    Daniel Khashabi, Tushar Khot, Ashish Sabharwal, Oyvind Tafjord, Peter Clark, and Hannaneh Hajishirzi. 2020. https://aclanthology.org/2020.findings-emnlp.171/ UnifiedQA : Crossing format boundaries with a single QA system . In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: Findings, pages 1896--1907

  12. [18]

    Steven D Levitt and Stephen J Dubner. 2014. Freakonomics. B DE BOOKS

  13. [20]

    Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019. https://arxiv.org/abs/1907.11692 RoBERTa : A robustly optimized BERT pretraining approach . arXiv preprint arXiv:1907.11692

  14. [21]

    Kaiji Lu, Piotr Mardziel, Fangjing Wu, Preetam Amancharla, and Anupam Datta. 2020. https://link.springer.com/chapter/10.1007/978-3-030-62077-6_14 Gender bias in neural natural language processing . In Logic, Language, and Security, pages 189--202. Springer

  15. [23]

    Victor Mair. 2018. https://languagelog.ldc.upenn.edu/nll/?p=36228 Language log: Ask language log: Are east asian first names gendered? Language Log. Accessed July 2021

  16. [26]

    Keiko Nakao and Judith Treas. 1994. https://www.jstor.org/stable/270978?seq=1#metadata_info_tab_contents Updating occupational prestige and socioeconomic scores: How the new measures measure up . Sociological methodology, pages 1--72

  17. [27]

    NYC OpenData. 2021. https://data.cityofnewyork.us/Health/Popular-Baby-Names/25th-nujf Popular baby names . Accessed July 2021

  18. [29]

    Paul R \"o ttger, Bertie Vidgen, Dong Nguyen, Zeerak Waseem, Helen Margetts, and Janet Pierrehumbert. 2021. https://aclanthology.org/2021.acl-long.4 H ate C heck: Functional tests for hate speech detection models . In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natu...

  19. [33]

    Konstantinos Tzioumis. 2018. https://www.nature.com/articles/sdata201825 Demographic aspects of first names . Scientific data, 5(1):1--9

  20. [34]

    United States Census Bureau . 1990. https://www.census.gov/topics/population/genealogy/data/1990_census/1990_census_namefiles.html Frequently occurring surnames from census 1990 – names files . Accessed July 2021

  21. [36]

    Kellie Webster, Marta Recasens, Vera Axelrod, and Jason Baldridge. 2018. https://transacl.org/index.php/tacl/article/view/1484 Mind the gap: A balanced corpus of gendered ambiguous pronouns . Transactions of the Association for Computational Linguistics, 6:605--617

  22. [37]

    They call me Bruce , but they won't call me Bruce Jones :

    Ellen Dionne Wu. 1999. https://www.tandfonline.com/doi/abs/10.1179/nam.1999.47.1.21 “ They call me Bruce , but they won't call me Bruce Jones :” Asian American naming preferences and patterns . Names, 47(1):21--50

  23. [39]

    Proceedings of the AAAI Conference on Artificial Intelligence , volume=

    On measuring and mitigating biased inferences of word embeddings , author=. Proceedings of the AAAI Conference on Artificial Intelligence , volume=. 2020 , url=

  24. [40]

    2020 , url=

    Khashabi, Daniel and Khot, Tushar and Sabharwal, Ashish and Tafjord, Oyvind and Clark, Peter and Hajishirzi, Hannaneh , booktitle=. 2020 , url=

  25. [41]

    UNQOVER ing Stereotyping Biases via Underspecified Questions

    Li, Tao and Khashabi, Daniel and Khot, Tushar and Sabharwal, Ashish and Srikumar, Vivek. UNQOVER ing Stereotyping Biases via Underspecified Questions. Findings of the Association for Computational Linguistics: EMNLP 2020. 2020. doi:10.18653/v1/2020.findings-emnlp.311

  26. [42]

    arXiv preprint arXiv:2004.09456 , year=

    Stereoset: Measuring stereotypical bias in pretrained language models , author=. arXiv preprint arXiv:2004.09456 , year=

  27. [43]

    arXiv preprint arXiv:2010.00133 , year=

    Crows-pairs: A challenge dataset for measuring social biases in masked language models , author=. arXiv preprint arXiv:2010.00133 , year=

  28. [44]

    The Woman Worked as a Babysitter: On Biases in Language Generation

    Sheng, Emily and Chang, Kai-Wei and Natarajan, Premkumar and Peng, Nanyun. The Woman Worked as a Babysitter: On Biases in Language Generation. Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). 2019. doi:10.18653/v1/D19-1339

  29. [45]

    Language (technology) is power: A critical survey of" bias" in

    Blodgett, Su Lin and Barocas, Solon and Daum. Language (technology) is power: A critical survey of" bias" in. Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics , pages=. 2020 , url=

  30. [46]

    2017 , note=

    The trouble with bias , author=. 2017 , note=

  31. [47]

    2021 , note=

    Prohibited Employment Policies/Practices , author=. 2021 , note=

  32. [48]

    Transactions of the Association for Computational Linguistics , volume=

    Mind the gap: A balanced corpus of gendered ambiguous pronouns , author=. Transactions of the Association for Computational Linguistics , volume=. 2018 , publisher=

  33. [49]

    and Choi, Yejin

    Sap, Maarten and Gabriel, Saadia and Qin, Lianhui and Jurafsky, Dan and Smith, Noah A. and Choi, Yejin. Social Bias Frames: Reasoning about Social and Power Implications of Language. Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. 2020. doi:10.18653/v1/2020.acl-main.486

  34. [50]

    Toward Gender-Inclusive Coreference Resolution

    Cao, Yang Trista and Daum \'e III, Hal. Toward Gender-Inclusive Coreference Resolution. Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. 2020. doi:10.18653/v1/2020.acl-main.418

  35. [51]

    Gender Bias in Coreference Resolution

    Rudinger, Rachel and Naradowsky, Jason and Leonard, Brian and Van Durme, Benjamin. Gender Bias in Coreference Resolution. Proceedings of the 2018 Conference of the North A merican Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers). 2018. doi:10.18653/v1/N18-2002

  36. [52]

    Gender Bias in Coreference Resolution: Evaluation and Debiasing Methods

    Zhao, Jieyu and Wang, Tianlu and Yatskar, Mark and Ordonez, Vicente and Chang, Kai-Wei. Gender Bias in Coreference Resolution: Evaluation and Debiasing Methods. Proceedings of the 2018 Conference of the North A merican Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers). 2018. doi:10.18653/v1/N18-2003

  37. [53]

    Logic, Language, and Security , pages=

    Gender bias in neural natural language processing , author=. Logic, Language, and Security , pages=. 2020 , publisher=

  38. [54]

    arXiv preprint: 2108.03362 , year=

    What do Bias Measures Measure? , author=. arXiv preprint: 2108.03362 , year=

  39. [55]

    2019 , url=

    Liu, Yinhan and Ott, Myle and Goyal, Naman and Du, Jingfei and Joshi, Mandar and Chen, Danqi and Levy, Omer and Lewis, Mike and Zettlemoyer, Luke and Stoyanov, Veselin , journal=. 2019 , url=

  40. [56]

    Sociological methodology , pages=

    Updating occupational prestige and socioeconomic scores: How the new measures measure up , author=. Sociological methodology , pages=. 1994 , publisher=

  41. [57]

    Bryson and Arvind Narayanan , title =

    Aylin Caliskan and Joanna J. Bryson and Arvind Narayanan , title =. Science , volume =. 2017 , doi =

  42. [58]

    Identifying and Reducing Gender Bias in Word-Level Language Models

    Bordia, Shikha and Bowman, Samuel R. Identifying and Reducing Gender Bias in Word-Level Language Models. Proceedings of the 2019 Conference of the North A merican Chapter of the Association for Computational Linguistics: Student Research Workshop. 2019. doi:10.18653/v1/N19-3002

  43. [59]

    and Rudinger, Rachel

    May, Chandler and Wang, Alex and Bordia, Shikha and Bowman, Samuel R. and Rudinger, Rachel. On Measuring Social Biases in Sentence Encoders. Proceedings of the 2019 Conference of the North A merican Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). 2019. doi:10.18653/v1/N19-1063

  44. [60]

    Racial Bias in Hate Speech and Abusive Language Detection Datasets

    Davidson, Thomas and Bhattacharya, Debasmita and Weber, Ingmar. Racial Bias in Hate Speech and Abusive Language Detection Datasets. Proceedings of the Third Workshop on Abusive Language Online. 2019. doi:10.18653/v1/W19-3504

  45. [61]

    arXiv preprint arXiv:2107.07691 , year =

    Liam Magee and Lida Ghahremanlou and Karen Soldatic and Shanthi Robertson , title =. arXiv preprint arXiv:2107.07691 , year =. 2107.07691 , timestamp =

  46. [62]

    H ate C heck: Functional Tests for Hate Speech Detection Models

    R. H ate C heck: Functional Tests for Hate Speech Detection Models. Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers). 2021

  47. [63]

    Proceedings of the NeurIPS 2020 Workshop on Dataset Curation and Security , year=

    Evaluating Gender Bias in Natural Language Inference , author=. Proceedings of the NeurIPS 2020 Workshop on Dataset Curation and Security , year=

  48. [64]

    2018 , url =

    Victor Mair , title =. 2018 , url =

  49. [65]

    Wu, Ellen Dionne , journal=. “. 1999 , publisher=

  50. [66]

    2021 , publisher=

    Cila, Jorida and Lalonde, Richard N and Sasaki, Joni Y and Mar, Raymond A and Lo, Ronda F , journal=. 2021 , publisher=

  51. [67]

    What ' s in a Name? R educing Bias in Bios without Access to Protected Attributes

    Romanov, Alexey and De-Arteaga, Maria and Wallach, Hanna and Chayes, Jennifer and Borgs, Christian and Chouldechova, Alexandra and Geyik, Sahin and Kenthapadi, Krishnaram and Rumshisky, Anna and Kalai, Adam. What ' s in a Name? R educing Bias in Bios without Access to Protected Attributes. Proceedings of the 2019 Conference of the North A merican Chapter ...

  52. [68]

    , author=

    Sex bias in the naming of stimulus persons. , author=. Psychological bulletin , volume=. 1993 , publisher=

  53. [69]

    Applied Economics Letters , volume=

    Race and gender effects on employer interest in job applicants: new evidence from a resume field experiment , author=. Applied Economics Letters , volume=. 2016 , publisher=

  54. [70]

    2009 , url =

    Kevin Bartz , title =. 2009 , url =

  55. [71]

    2021 , url =

    NYC OpenData , title =. 2021 , url =

  56. [72]

    Frequently Occurring Surnames from Census 1990 – Names Files , year =

  57. [73]

    Scientific data , volume=

    Demographic aspects of first names , author=. Scientific data , volume=. 2018 , publisher=

  58. [74]

    2014 , publisher=

    Freakonomics , author=. 2014 , publisher=

  59. [75]

    Learning from the Worst: Dynamically Generated Datasets to Improve Online Hate Detection

    Vidgen, Bertie and Thrush, Tristan and Waseem, Zeerak and Kiela, Douwe. Learning from the Worst: Dynamically Generated Datasets to Improve Online Hate Detection. Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers). 2021. d...

  60. [76]

    Detecting Independent Pronoun Bias with Partially-Synthetic Data Generation

    Munro, Robert and Morrison, Alex (Carmen). Detecting Independent Pronoun Bias with Partially-Synthetic Data Generation. Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). 2020. doi:10.18653/v1/2020.emnlp-main.157

  61. [77]

    Stereotyping N orwegian Salmon: An Inventory of Pitfalls in Fairness Benchmark Datasets

    Blodgett, Su Lin and Lopez, Gilsinia and Olteanu, Alexandra and Sim, Robert and Wallach, Hanna. Stereotyping N orwegian Salmon: An Inventory of Pitfalls in Fairness Benchmark Datasets. Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1...

  62. [78]

    RACE : Large-scale R e A ding Comprehension Dataset From Examinations

    Lai, Guokun and Xie, Qizhe and Liu, Hanxiao and Yang, Yiming and Hovy, Eduard. RACE : Large-scale R e A ding Comprehension Dataset From Examinations. Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing. 2017. doi:10.18653/v1/D17-1082

  63. [79]

    Think you have solved question answering?

    Clark, Peter and Cowhey, Isaac and Etzioni, Oren and Khot, Tushar and Sabharwal, Ashish and Schoenick, Carissa and Tafjord, Oyvind , journal=. Think you have solved question answering?. 2018 , url=

  64. [80]

    2020 , url=

    He, Pengcheng and Liu, Xiaodong and Gao, Jianfeng and Chen, Weizhu , journal=. 2020 , url=

  65. [81]

    2021 , url=

    He, Pengcheng and Gao, Jianfeng and Chen, Weizhu , journal=. 2021 , url=