BBQ is a new benchmark dataset showing that QA models often default to social stereotypes, achieving up to 3.4 points higher accuracy when the correct answer aligns with bias.
Semantics derived automatically from language corpora contain human-like biases
7 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
roles
other 1polarities
unclear 1representative citing papers
Under semantic underdetermination, LLMs cannot always guarantee strong correctness, strict non-bias, and high utility at once.
LLMs contain identifiable COCO neurons that enable implicit self-correction against stereotypes; targeted editing of these neurons improves fairness and robustness to jailbreaks while preserving generation quality.
Counterfactual prompting effects on LLMs are often indistinguishable from those caused by meaning-preserving paraphrases, causing most previously reported demographic sensitivities to disappear under proper statistical comparison.
A methodological framework detects subtle group-associated linguistic biases in LLM outputs by generating controlled synthetic minimal pairs, abstracting n-grams, and ranking high-signal fragments with a PMI variant for expert review.
Reasoning models expend more tokens on association-incompatible tasks than compatible ones, indicating greater effort on counter-stereotypical information, except for Claude 3.7 Sonnet which shows the reverse pattern linked to its bias-focused reasoning.
The authors provide a detailed taxonomy of 21 risks associated with language models, covering discrimination, information leaks, misinformation, malicious applications, interaction harms, and societal impacts like job loss and environmental costs.
citing papers explorer
-
BBQ: A Hand-Built Bias Benchmark for Question Answering
BBQ is a new benchmark dataset showing that QA models often default to social stereotypes, achieving up to 3.4 points higher accuracy when the correct answer aligns with bias.
-
A CAP-like Trilemma for Large Language Models: Correctness, Non-bias, and Utility under Semantic Underdetermination
Under semantic underdetermination, LLMs cannot always guarantee strong correctness, strict non-bias, and high utility at once.
-
Modeling Implicit Conflict Monitoring Mechanisms against Stereotypes in LLMs
LLMs contain identifiable COCO neurons that enable implicit self-correction against stereotypes; targeted editing of these neurons improves fairness and robustness to jailbreaks while preserving generation quality.
-
Compared to What? Baselines and Metrics for Counterfactual Prompting
Counterfactual prompting effects on LLMs are often indistinguishable from those caused by meaning-preserving paraphrases, causing most previously reported demographic sensitivities to disappear under proper statistical comparison.
-
Contrastive Analysis of Linguistic Representations in Large Language Model Outputs through Structured Synthetic Data Generation and Abstracted N-gram Associations
A methodological framework detects subtle group-associated linguistic biases in LLM outputs by generating controlled synthetic minimal pairs, abstracting n-grams, and ranking high-signal fragments with a PMI variant for expert review.
-
Implicit Bias-Like Patterns in Reasoning Models
Reasoning models expend more tokens on association-incompatible tasks than compatible ones, indicating greater effort on counter-stereotypical information, except for Claude 3.7 Sonnet which shows the reverse pattern linked to its bias-focused reasoning.
-
Ethical and social risks of harm from Language Models
The authors provide a detailed taxonomy of 21 risks associated with language models, covering discrimination, information leaks, misinformation, malicious applications, interaction harms, and societal impacts like job loss and environmental costs.