arxiv: 2110.08193 · v2 · submitted 2021-10-15 · 💻 cs.CL

Recognition: 2 theorem links

· Lean Theorem

BBQ: A Hand-Built Bias Benchmark for Question Answering

Alicia Parrish , Angelica Chen , Nikita Nangia , Vishakh Padmakumar , Jason Phang , Jana Thompson , Phu Mon Htut , Samuel R. Bowman

Authors on Pith no claims yet

Pith reviewed 2026-05-16 19:52 UTC · model grok-4.3

classification 💻 cs.CL

keywords bias benchmarkquestion answeringsocial stereotypesNLP fairnessprotected classesmodel evaluationgender bias

0 comments

The pith

Question answering models rely on social stereotypes, showing higher accuracy when correct answers align with biases.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents the Bias Benchmark for QA (BBQ), a dataset of hand-constructed questions targeting social biases across nine dimensions in U.S. English contexts. It evaluates models in under-informative contexts to measure bias reliance and in informative contexts to check if biases override correct answers. Models reproduce harmful stereotypes in ambiguous settings and achieve up to 3.4 percentage points higher accuracy when the right answer matches a bias, with larger gaps for gender. This demonstrates how biases affect practical NLP applications like QA.

Core claim

Models often rely on stereotypes when the context is under-informative, meaning the model's outputs consistently reproduce harmful biases in this setting. Though models are more accurate when the context provides an informative answer, they still rely on stereotypes and average up to 3.4 percentage points higher accuracy when the correct answer aligns with a social bias than when it conflicts, with this difference widening to over 5 points on examples targeting gender for most models tested.

What carries the argument

The BBQ dataset of question sets that highlight attested social biases against protected classes along nine social dimensions.

If this is right

Models produce biased outputs in under-informative contexts by defaulting to stereotypes.
Accuracy is consistently higher when the correct answer aligns with a social bias.
The bias effect is stronger on gender-targeted examples, exceeding 5 percentage points for most models.
BBQ can be used to evaluate and track bias in QA models over time.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Developers of QA systems should test for bias in ambiguous scenarios common in real use.
The benchmark could help identify which dimensions cause the most bias for specific models.
Extending BBQ to other languages might reveal cultural differences in model biases.

Load-bearing premise

The hand-constructed questions accurately capture attested real-world social biases in U.S. English contexts without introducing artificial patterns that models exploit differently from natural text.

What would settle it

A test showing no difference in model accuracy between bias-aligned and bias-conflicting correct answers in the informative context setting.

read the original abstract

It is well documented that NLP models learn social biases, but little work has been done on how these biases manifest in model outputs for applied tasks like question answering (QA). We introduce the Bias Benchmark for QA (BBQ), a dataset of question sets constructed by the authors that highlight attested social biases against people belonging to protected classes along nine social dimensions relevant for U.S. English-speaking contexts. Our task evaluates model responses at two levels: (i) given an under-informative context, we test how strongly responses reflect social biases, and (ii) given an adequately informative context, we test whether the model's biases override a correct answer choice. We find that models often rely on stereotypes when the context is under-informative, meaning the model's outputs consistently reproduce harmful biases in this setting. Though models are more accurate when the context provides an informative answer, they still rely on stereotypes and average up to 3.4 percentage points higher accuracy when the correct answer aligns with a social bias than when it conflicts, with this difference widening to over 5 points on examples targeting gender for most models tested.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

BBQ gives a practical new benchmark for measuring stereotype reliance in QA with two clear evaluation regimes, though the hand-built items leave open the possibility that models are picking up construction patterns rather than real biases.

read the letter

BBQ is a new hand-built benchmark for testing stereotype reliance in question answering models. It shows that models often pick biased answers when context is under-informative and still get a small accuracy boost, around 3.4 points on average, when the correct answer lines up with a social bias in informative contexts. The gender-targeted items show a larger gap for most models tested.

Referee Report

3 major / 2 minor

Summary. The paper introduces the Bias Benchmark for QA (BBQ), a hand-constructed dataset of question sets targeting attested social biases across nine dimensions in U.S. English contexts. It evaluates QA models in under-informative contexts (to measure stereotype reliance in outputs) and adequately informative contexts (to test whether biases override correct answers), finding that models reproduce harmful stereotypes in ambiguous settings and show an average 3.4 percentage point accuracy advantage (widening to over 5 points for gender) when the correct answer aligns with a social bias.

Significance. If the empirical measurements hold, BBQ offers a valuable applied benchmark for how social biases manifest in QA outputs rather than isolated representations, directly addressing a gap in task-specific bias evaluation. The reported accuracy gaps provide falsifiable, quantitative evidence of persistent stereotype effects even with informative context, which could guide targeted mitigation work.

major comments (3)

[Dataset construction] Dataset construction section: the manuscript provides no inter-annotator agreement statistics or validation procedure for the hand-authored contexts, questions, and answer choices. Without these, the reliability of the bias dimension labels and the claim that items capture attested rather than author-specific patterns cannot be assessed.
[Results] Results section (discussion of accuracy gaps): no statistical testing, confidence intervals, or significance assessment is reported for the 3.4 pp overall gap or the >5 pp gender gap. Given the modest size of the effect, this omission weakens the central claim that models 'still rely on stereotypes' in informative contexts.
[Evaluation] Evaluation and analysis sections: the paper does not include any comparison or external validation showing that the hand-built items match the distribution or linguistic properties of naturally occurring biased QA instances in U.S. English. This leaves open the possibility that measured effects partly reflect construction artifacts rather than real-world social biases.

minor comments (2)

[Abstract] Abstract: specify the total number of models evaluated and list the exact nine social dimensions for clarity.
[Figures/Tables] Figure and table captions: ensure all bias dimensions and model names are fully spelled out rather than abbreviated.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their thoughtful review and constructive comments, as well as the positive recommendation for minor revision. We address each major comment below.

read point-by-point responses

Referee: [Dataset construction] Dataset construction section: the manuscript provides no inter-annotator agreement statistics or validation procedure for the hand-authored contexts, questions, and answer choices. Without these, the reliability of the bias dimension labels and the claim that items capture attested rather than author-specific patterns cannot be assessed.

Authors: The BBQ dataset was hand-constructed by the authors based on attested biases documented in the social science literature rather than crowd-sourced annotations, so standard inter-annotator agreement statistics do not apply. We will revise the dataset construction section to provide a more detailed description of the item creation process, the specific references used to ground each bias dimension, and the internal consistency checks performed to ensure items target documented social patterns rather than author-specific ones. revision: yes
Referee: [Results] Results section (discussion of accuracy gaps): no statistical testing, confidence intervals, or significance assessment is reported for the 3.4 pp overall gap or the >5 pp gender gap. Given the modest size of the effect, this omission weakens the central claim that models 'still rely on stereotypes' in informative contexts.

Authors: We agree that statistical assessment would strengthen the results. In the revised manuscript we will add bootstrap-derived 95% confidence intervals for all reported accuracy differences (including the overall 3.4 pp gap and the gender gap exceeding 5 pp) along with p-values from paired significance tests to evaluate whether the observed differences are statistically reliable. revision: yes
Referee: [Evaluation] Evaluation and analysis sections: the paper does not include any comparison or external validation showing that the hand-built items match the distribution or linguistic properties of naturally occurring biased QA instances in U.S. English. This leaves open the possibility that measured effects partly reflect construction artifacts rather than real-world social biases.

Authors: We acknowledge the value of such validation but note that BBQ is deliberately constructed as a controlled, targeted benchmark to isolate specific bias effects rather than to replicate the distribution of naturally occurring QA data. We will add an explicit limitations paragraph discussing this design choice and the trade-off between controlled construction and ecological validity, while retaining the grounding in attested biases from prior literature. revision: partial

Circularity Check

0 steps flagged

No circularity: empirical measurements on newly constructed benchmark dataset

full rationale

The paper introduces the BBQ dataset via hand-authored question sets targeting attested U.S. English social biases across nine dimensions. All reported results consist of direct accuracy and stereotype-rate measurements on held-out model evaluations in under-informative versus informative contexts. No equations, fitted parameters, or derived predictions appear; the central claims are statistical aggregates over the authors' own test items rather than reductions of any quantity to prior fitted values or self-citations. The evaluation chain is therefore self-contained empirical testing and does not reduce to its inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The paper relies on standard dataset-construction practices and previously attested social biases documented in psychology and sociology literature; no new free parameters, mathematical axioms, or postulated entities are introduced.

pith-pipeline@v0.9.0 · 5515 in / 1045 out tokens · 26400 ms · 2026-05-16T19:52:09.470499+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We introduce the Bias Benchmark for QA (BBQ), a dataset of question sets constructed by the authors that highlight attested social biases... models often rely on stereotypes when the context is under-informative... accuracy difference widening to over 5 points on examples targeting gender
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Each example appears with two questions that reflect a negative or harmful bias... bias score sDIS = 2*(nbiased_ans / nnon-UNKNOWN_outputs) - 1

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 18 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Social Bias in LLM-Generated Code: Benchmark and Mitigation
cs.SE 2026-05 unverdicted novelty 7.0

LLMs show up to 60.58% social bias in generated code; a new Fairness Monitor Agent cuts bias by 65.1% and raises functional correctness from 75.80% to 83.97%.
BIASEDTALES-ML: A Multilingual Dataset for Analyzing Narrative Attribute Distributions in LLM-Generated Stories
cs.CL 2026-04 unverdicted novelty 7.0

BiasedTales-ML provides a parallel multilingual corpus of LLM-generated children's stories that reveals substantial cross-lingual differences in narrative attributes not captured by English-centric analyses.
Navigating the Sea of LLM Evaluation: Investigating Bias in Toxicity Benchmarks
cs.AI 2026-05 unverdicted novelty 6.0

Toxicity benchmarks for LLMs produce inconsistent results when task type, input domain, or model changes, revealing intrinsic evaluation biases.
In-Situ Behavioral Evaluation for LLM Fairness, Not Standardized-Test Scores
cs.CL 2026-04 unverdicted novelty 6.0

Standardized-test benchmarks for LLM fairness are unreliable because prompt wording alone drives most score variance and ranking changes, while a multi-agent conversational framework reveals consistent model-specific ...
Parcae: Scaling Laws For Stable Looped Language Models
cs.LG 2026-04 unverdicted novelty 6.0

Parcae stabilizes looped LLMs via spectral norm constraints on injection parameters, enabling power-law scaling for training FLOPs and saturating exponential scaling at test time that improves quality over fixed-depth...
Aligned Agents, Biased Swarm: Measuring Bias Amplification in Multi-Agent Systems
cs.MA 2026-04 unverdicted novelty 6.0

Multi-agent systems amplify minor stochastic biases into systemic polarization via echo-chamber effects in structured workflows, even with neutral agents.
An Independent Safety Evaluation of Kimi K2.5
cs.CR 2026-04 conditional novelty 6.0

Kimi K2.5 matches closed models on dual-use tasks but refuses fewer CBRNE requests and shows some sabotage and self-replication tendencies.
Blind Refusal: Language Models Refuse to Help Users Evade Unjust, Absurd, and Illegitimate Rules
cs.AI 2026-04 unverdicted novelty 6.0

Language models refuse 75.4% of requests to evade defeated rules and do so even after recognizing reasons that undermine the rule's legitimacy.
Language Models (Mostly) Know What They Know
cs.CL 2022-07 unverdicted novelty 6.0

Language models show good calibration when asked to estimate the probability that their own answers are correct, with performance improving as models get larger.
Emergent Abilities of Large Language Models
cs.CL 2022-06 unverdicted novelty 6.0

Emergent abilities are capabilities present in large language models but absent in smaller ones and cannot be predicted by extrapolating smaller model performance.
Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback
cs.CL 2022-04 unverdicted novelty 6.0

RLHF alignment training on language models boosts NLP performance, supports skill specialization, enables weekly online updates with fresh human data, and shows a linear relation between RL reward and sqrt(KL divergen...
Representational Harms in LLM-Generated Narratives Against Global Majority Nationalities
cs.CL 2026-04 unverdicted novelty 5.0

LLMs generate narratives containing persistent stereotypes, erasure, and one-dimensional portrayals of Global Majority national identities, with minoritized groups overrepresented in subordinated roles by more than fi...
Intersectional Fairness in Large Language Models
cs.CL 2026-04 unverdicted novelty 5.0

LLMs are more accurate when answers match stereotypes in clear contexts, especially for race-gender combinations, and no tested model shows consistent fairness or reliability across intersectional groups.
gpt-oss-120b & gpt-oss-20b Model Card
cs.CL 2025-08 unverdicted novelty 5.0

OpenAI releases two open-weight reasoning models, gpt-oss-120b and gpt-oss-20b, trained via distillation and RL with claimed strong results on math, coding, and safety benchmarks.
PaLM 2 Technical Report
cs.CL 2023-05 unverdicted novelty 5.0

PaLM 2 reports state-of-the-art results on language, reasoning, and multilingual tasks with improved efficiency over PaLM.
OpenAI GPT-5 System Card
cs.CL 2025-12 unverdicted novelty 3.0

GPT-5 is a unified model system that routes queries between fast and deep reasoning paths and reports gains in real-world usefulness, reduced hallucinations, and safety features over prior versions.
Fairness in Multi-Agent Systems for Software Engineering: An SDLC-Oriented Rapid Review
cs.SE 2026-04 unverdicted novelty 2.0

A rapid review of fairness in LLM-enabled multi-agent systems for the software development lifecycle concludes that the field lacks standardized evaluations, broad coverage, and effective governance, leaving it unprep...
The nextAI Solution to the NeurIPS 2023 LLM Efficiency Challenge
cs.LG 2026-04 unverdicted novelty 2.0

A competition entry achieved efficient fine-tuning of LLaMa2 70B on one GPU in 24 hours with competitive QA benchmark performance.

Reference graph

Works this paper leans on

65 extracted references · 65 canonical work pages · cited by 18 Pith papers · 3 internal anchors

[1]

Kevin Bartz. 2009. https://blogs.iq.harvard.edu/english_first_n English first names for chinese americans . Harvard University Social Science Statistics Blog. Accessed July 2021

work page 2009
[2]

Su Lin Blodgett, Solon Barocas, Hal Daum \'e III, and Hanna Wallach. 2020. https://aclanthology.org/2020.acl-main.485/ Language (technology) is power: A critical survey of" bias" in NLP . In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 5454--5476

work page 2020
[5]

Bryson, and Arvind Narayanan

Aylin Caliskan, Joanna J. Bryson, and Arvind Narayanan. 2017. https://doi.org/10.1126/science.aal4230 Semantics derived automatically from language corpora contain human-like biases . Science, 356(6334):183--186

work page doi:10.1126/science.aal4230 2017
[7]

Jorida Cila, Richard N Lalonde, Joni Y Sasaki, Raymond A Mar, and Ronda F Lo. 2021. https://psycnet.apa.org/fulltext/2020-69298-001.html Zahra or Zoe , Arjun or Andrew ? Bicultural baby names reflect identity and pragmatic concerns. Cultural Diversity and Ethnic Minority Psychology, 27(3):307

work page 2021
[8]

Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. 2018. https://arxiv.org/abs/1803.05457 Think you have solved question answering? T ry ARC , the AI2 reasoning challenge . arXiv preprint arXiv:1803.05457

work page internal anchor Pith review Pith/arXiv arXiv 2018
[9]

Kate Crawford. 2017. The trouble with bias. http://youtube.com/watch?v=fMym_BKWQzk. Talk given at NeurIPS December 2017

work page 2017
[10]

Rajeev Darolia, Cory Koedel, Paco Martorell, Katie Wilson, and Francisco Perez-Arce. 2016. https://www.tandfonline.com/doi/full/10.1080/13504851.2015.1114571 Race and gender effects on employer interest in job applicants: new evidence from a resume field experiment . Applied Economics Letters, 23(12):853--856

work page doi:10.1080/13504851.2015.1114571 2016
[13]

U.S. EEOC. 2021. https://www.eeoc.gov/prohibited-employment-policiespractices Prohibited employment policies/practices . Accessed August 2021

work page 2021
[14]

Pengcheng He, Jianfeng Gao, and Weizhu Chen. 2021. https://arxiv.org/abs/2111.09543 DeBERTaV3 : Improving DeBERTa using ELECTRA -style pre-training with gradient-disentangled embedding sharing . arXiv preprint arXiv:2111.09543

work page internal anchor Pith review Pith/arXiv arXiv 2021
[15]

Joseph Kasof. 1993. https://psycnet.apa.org/record/1993-16088-001 Sex bias in the naming of stimulus persons. Psychological bulletin, 113(1):140

work page 1993
[16]

Daniel Khashabi, Tushar Khot, Ashish Sabharwal, Oyvind Tafjord, Peter Clark, and Hannaneh Hajishirzi. 2020. https://aclanthology.org/2020.findings-emnlp.171/ UnifiedQA : Crossing format boundaries with a single QA system . In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: Findings, pages 1896--1907

work page 2020
[18]

Steven D Levitt and Stephen J Dubner. 2014. Freakonomics. B DE BOOKS

work page 2014
[20]

Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019. https://arxiv.org/abs/1907.11692 RoBERTa : A robustly optimized BERT pretraining approach . arXiv preprint arXiv:1907.11692

work page internal anchor Pith review Pith/arXiv arXiv 2019
[21]

Kaiji Lu, Piotr Mardziel, Fangjing Wu, Preetam Amancharla, and Anupam Datta. 2020. https://link.springer.com/chapter/10.1007/978-3-030-62077-6_14 Gender bias in neural natural language processing . In Logic, Language, and Security, pages 189--202. Springer

work page doi:10.1007/978-3-030-62077-6_14 2020
[23]

Victor Mair. 2018. https://languagelog.ldc.upenn.edu/nll/?p=36228 Language log: Ask language log: Are east asian first names gendered? Language Log. Accessed July 2021

work page 2018
[26]

Keiko Nakao and Judith Treas. 1994. https://www.jstor.org/stable/270978?seq=1#metadata_info_tab_contents Updating occupational prestige and socioeconomic scores: How the new measures measure up . Sociological methodology, pages 1--72

work page 1994
[27]

NYC OpenData. 2021. https://data.cityofnewyork.us/Health/Popular-Baby-Names/25th-nujf Popular baby names . Accessed July 2021

work page 2021
[29]

Paul R \"o ttger, Bertie Vidgen, Dong Nguyen, Zeerak Waseem, Helen Margetts, and Janet Pierrehumbert. 2021. https://aclanthology.org/2021.acl-long.4 H ate C heck: Functional tests for hate speech detection models . In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natu...

work page 2021
[33]

Konstantinos Tzioumis. 2018. https://www.nature.com/articles/sdata201825 Demographic aspects of first names . Scientific data, 5(1):1--9

work page 2018
[34]

United States Census Bureau . 1990. https://www.census.gov/topics/population/genealogy/data/1990_census/1990_census_namefiles.html Frequently occurring surnames from census 1990 – names files . Accessed July 2021

work page 1990
[36]

Kellie Webster, Marta Recasens, Vera Axelrod, and Jason Baldridge. 2018. https://transacl.org/index.php/tacl/article/view/1484 Mind the gap: A balanced corpus of gendered ambiguous pronouns . Transactions of the Association for Computational Linguistics, 6:605--617

work page 2018
[37]

They call me Bruce , but they won't call me Bruce Jones :

Ellen Dionne Wu. 1999. https://www.tandfonline.com/doi/abs/10.1179/nam.1999.47.1.21 “ They call me Bruce , but they won't call me Bruce Jones :” Asian American naming preferences and patterns . Names, 47(1):21--50

work page doi:10.1179/nam.1999.47.1.21 1999
[39]

Proceedings of the AAAI Conference on Artificial Intelligence , volume=

On measuring and mitigating biased inferences of word embeddings , author=. Proceedings of the AAAI Conference on Artificial Intelligence , volume=. 2020 , url=

work page 2020
[40]

2020 , url=

Khashabi, Daniel and Khot, Tushar and Sabharwal, Ashish and Tafjord, Oyvind and Clark, Peter and Hajishirzi, Hannaneh , booktitle=. 2020 , url=

work page 2020
[41]

UNQOVER ing Stereotyping Biases via Underspecified Questions

Li, Tao and Khashabi, Daniel and Khot, Tushar and Sabharwal, Ashish and Srikumar, Vivek. UNQOVER ing Stereotyping Biases via Underspecified Questions. Findings of the Association for Computational Linguistics: EMNLP 2020. 2020. doi:10.18653/v1/2020.findings-emnlp.311

work page doi:10.18653/v1/2020.findings-emnlp.311 2020
[42]

arXiv preprint arXiv:2004.09456 , year=

Stereoset: Measuring stereotypical bias in pretrained language models , author=. arXiv preprint arXiv:2004.09456 , year=

work page arXiv 2004
[43]

arXiv preprint arXiv:2010.00133 , year=

Crows-pairs: A challenge dataset for measuring social biases in masked language models , author=. arXiv preprint arXiv:2010.00133 , year=

work page arXiv 2010
[44]

The Woman Worked as a Babysitter: On Biases in Language Generation

Sheng, Emily and Chang, Kai-Wei and Natarajan, Premkumar and Peng, Nanyun. The Woman Worked as a Babysitter: On Biases in Language Generation. Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). 2019. doi:10.18653/v1/D19-1339

work page doi:10.18653/v1/d19-1339 2019
[45]

Language (technology) is power: A critical survey of" bias" in

Blodgett, Su Lin and Barocas, Solon and Daum. Language (technology) is power: A critical survey of" bias" in. Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics , pages=. 2020 , url=

work page 2020
[46]

2017 , note=

The trouble with bias , author=. 2017 , note=

work page 2017
[47]

2021 , note=

Prohibited Employment Policies/Practices , author=. 2021 , note=

work page 2021
[48]

Transactions of the Association for Computational Linguistics , volume=

Mind the gap: A balanced corpus of gendered ambiguous pronouns , author=. Transactions of the Association for Computational Linguistics , volume=. 2018 , publisher=

work page 2018
[49]

and Choi, Yejin

Sap, Maarten and Gabriel, Saadia and Qin, Lianhui and Jurafsky, Dan and Smith, Noah A. and Choi, Yejin. Social Bias Frames: Reasoning about Social and Power Implications of Language. Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. 2020. doi:10.18653/v1/2020.acl-main.486

work page doi:10.18653/v1/2020.acl-main.486 2020
[50]

Toward Gender-Inclusive Coreference Resolution

Cao, Yang Trista and Daum \'e III, Hal. Toward Gender-Inclusive Coreference Resolution. Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. 2020. doi:10.18653/v1/2020.acl-main.418

work page doi:10.18653/v1/2020.acl-main.418 2020
[51]

Gender Bias in Coreference Resolution

Rudinger, Rachel and Naradowsky, Jason and Leonard, Brian and Van Durme, Benjamin. Gender Bias in Coreference Resolution. Proceedings of the 2018 Conference of the North A merican Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers). 2018. doi:10.18653/v1/N18-2002

work page doi:10.18653/v1/n18-2002 2018
[52]

Gender Bias in Coreference Resolution: Evaluation and Debiasing Methods

Zhao, Jieyu and Wang, Tianlu and Yatskar, Mark and Ordonez, Vicente and Chang, Kai-Wei. Gender Bias in Coreference Resolution: Evaluation and Debiasing Methods. Proceedings of the 2018 Conference of the North A merican Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers). 2018. doi:10.18653/v1/N18-2003

work page doi:10.18653/v1/n18-2003 2018
[53]

Logic, Language, and Security , pages=

Gender bias in neural natural language processing , author=. Logic, Language, and Security , pages=. 2020 , publisher=

work page 2020
[54]

arXiv preprint: 2108.03362 , year=

What do Bias Measures Measure? , author=. arXiv preprint: 2108.03362 , year=

work page arXiv
[55]

2019 , url=

Liu, Yinhan and Ott, Myle and Goyal, Naman and Du, Jingfei and Joshi, Mandar and Chen, Danqi and Levy, Omer and Lewis, Mike and Zettlemoyer, Luke and Stoyanov, Veselin , journal=. 2019 , url=

work page 2019
[56]

Sociological methodology , pages=

Updating occupational prestige and socioeconomic scores: How the new measures measure up , author=. Sociological methodology , pages=. 1994 , publisher=

work page 1994
[57]

Bryson and Arvind Narayanan , title =

Aylin Caliskan and Joanna J. Bryson and Arvind Narayanan , title =. Science , volume =. 2017 , doi =

work page 2017
[58]

Identifying and Reducing Gender Bias in Word-Level Language Models

Bordia, Shikha and Bowman, Samuel R. Identifying and Reducing Gender Bias in Word-Level Language Models. Proceedings of the 2019 Conference of the North A merican Chapter of the Association for Computational Linguistics: Student Research Workshop. 2019. doi:10.18653/v1/N19-3002

work page doi:10.18653/v1/n19-3002 2019
[59]

and Rudinger, Rachel

May, Chandler and Wang, Alex and Bordia, Shikha and Bowman, Samuel R. and Rudinger, Rachel. On Measuring Social Biases in Sentence Encoders. Proceedings of the 2019 Conference of the North A merican Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). 2019. doi:10.18653/v1/N19-1063

work page doi:10.18653/v1/n19-1063 2019
[60]

Racial Bias in Hate Speech and Abusive Language Detection Datasets

Davidson, Thomas and Bhattacharya, Debasmita and Weber, Ingmar. Racial Bias in Hate Speech and Abusive Language Detection Datasets. Proceedings of the Third Workshop on Abusive Language Online. 2019. doi:10.18653/v1/W19-3504

work page doi:10.18653/v1/w19-3504 2019
[61]

arXiv preprint arXiv:2107.07691 , year =

Liam Magee and Lida Ghahremanlou and Karen Soldatic and Shanthi Robertson , title =. arXiv preprint arXiv:2107.07691 , year =. 2107.07691 , timestamp =

work page arXiv
[62]

H ate C heck: Functional Tests for Hate Speech Detection Models

R. H ate C heck: Functional Tests for Hate Speech Detection Models. Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers). 2021

work page 2021
[63]

Proceedings of the NeurIPS 2020 Workshop on Dataset Curation and Security , year=

Evaluating Gender Bias in Natural Language Inference , author=. Proceedings of the NeurIPS 2020 Workshop on Dataset Curation and Security , year=

work page 2020
[64]

2018 , url =

Victor Mair , title =. 2018 , url =

work page 2018
[65]

Wu, Ellen Dionne , journal=. “. 1999 , publisher=

work page 1999
[66]

2021 , publisher=

Cila, Jorida and Lalonde, Richard N and Sasaki, Joni Y and Mar, Raymond A and Lo, Ronda F , journal=. 2021 , publisher=

work page 2021
[67]

What ' s in a Name? R educing Bias in Bios without Access to Protected Attributes

Romanov, Alexey and De-Arteaga, Maria and Wallach, Hanna and Chayes, Jennifer and Borgs, Christian and Chouldechova, Alexandra and Geyik, Sahin and Kenthapadi, Krishnaram and Rumshisky, Anna and Kalai, Adam. What ' s in a Name? R educing Bias in Bios without Access to Protected Attributes. Proceedings of the 2019 Conference of the North A merican Chapter ...

work page doi:10.18653/v1/n19-1424 2019
[68]

, author=

Sex bias in the naming of stimulus persons. , author=. Psychological bulletin , volume=. 1993 , publisher=

work page 1993
[69]

Applied Economics Letters , volume=

Race and gender effects on employer interest in job applicants: new evidence from a resume field experiment , author=. Applied Economics Letters , volume=. 2016 , publisher=

work page 2016
[70]

2009 , url =

Kevin Bartz , title =. 2009 , url =

work page 2009
[71]

2021 , url =

NYC OpenData , title =. 2021 , url =

work page 2021
[72]

Frequently Occurring Surnames from Census 1990 – Names Files , year =

work page 1990
[73]

Scientific data , volume=

Demographic aspects of first names , author=. Scientific data , volume=. 2018 , publisher=

work page 2018
[74]

2014 , publisher=

Freakonomics , author=. 2014 , publisher=

work page 2014
[75]

Learning from the Worst: Dynamically Generated Datasets to Improve Online Hate Detection

Vidgen, Bertie and Thrush, Tristan and Waseem, Zeerak and Kiela, Douwe. Learning from the Worst: Dynamically Generated Datasets to Improve Online Hate Detection. Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers). 2021. d...

work page doi:10.18653/v1/2021.acl-long.132 2021
[76]

Detecting Independent Pronoun Bias with Partially-Synthetic Data Generation

Munro, Robert and Morrison, Alex (Carmen). Detecting Independent Pronoun Bias with Partially-Synthetic Data Generation. Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). 2020. doi:10.18653/v1/2020.emnlp-main.157

work page doi:10.18653/v1/2020.emnlp-main.157 2020
[77]

Stereotyping N orwegian Salmon: An Inventory of Pitfalls in Fairness Benchmark Datasets

Blodgett, Su Lin and Lopez, Gilsinia and Olteanu, Alexandra and Sim, Robert and Wallach, Hanna. Stereotyping N orwegian Salmon: An Inventory of Pitfalls in Fairness Benchmark Datasets. Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1...

work page doi:10.18653/v1/2021.acl-long.81 2021
[78]

RACE : Large-scale R e A ding Comprehension Dataset From Examinations

Lai, Guokun and Xie, Qizhe and Liu, Hanxiao and Yang, Yiming and Hovy, Eduard. RACE : Large-scale R e A ding Comprehension Dataset From Examinations. Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing. 2017. doi:10.18653/v1/D17-1082

work page doi:10.18653/v1/d17-1082 2017
[79]

Think you have solved question answering?

Clark, Peter and Cowhey, Isaac and Etzioni, Oren and Khot, Tushar and Sabharwal, Ashish and Schoenick, Carissa and Tafjord, Oyvind , journal=. Think you have solved question answering?. 2018 , url=

work page 2018
[80]

2020 , url=

He, Pengcheng and Liu, Xiaodong and Gao, Jianfeng and Chen, Weizhu , journal=. 2020 , url=

work page 2020
[81]

2021 , url=

He, Pengcheng and Gao, Jianfeng and Chen, Weizhu , journal=. 2021 , url=

work page 2021