Measuring and Reducing Gendered Correlations in Pre-trained Models

Alex Beutel; Ed Chi; Ellie Pavlick; Emily Pitler; Ian Tenney; Jilin Chen; Kellie Webster; Slav Petrov; Xuezhi Wang

arxiv: 2010.06032 · v2 · pith:D4FUX5SCnew · submitted 2020-10-12 · 💻 cs.CL

Measuring and Reducing Gendered Correlations in Pre-trained Models

Kellie Webster , Xuezhi Wang , Ian Tenney , Alex Beutel , Emily Pitler , Ellie Pavlick , Jilin Chen , Ed Chi

show 1 more author

Slav Petrov

This is my paper

classification 💻 cs.CL

keywords correlationsmodelspre-traineddifferentencodegenderedunintendedaccuracy

0 comments

read the original abstract

Pre-trained models have revolutionized natural language understanding. However, researchers have found they can encode artifacts undesired in many applications, such as professions correlating with one gender more than another. We explore such gendered correlations as a case study for how to address unintended correlations in pre-trained models. We define metrics and reveal that it is possible for models with similar accuracy to encode correlations at very different rates. We show how measured correlations can be reduced with general-purpose techniques, and highlight the trade offs different strategies have. With these results, we make recommendations for training robust models: (1) carefully evaluate unintended correlations, (2) be mindful of seemingly innocuous configuration differences, and (3) focus on general mitigations.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 12 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

StereoTales: A Multilingual Framework for Open-Ended Stereotype Discovery in LLMs
cs.CY 2026-05 unverdicted novelty 7.0

StereoTales shows that LLMs produce harmful, culturally adapted stereotypes in open-ended multilingual stories, with patterns consistent across providers and aligned human-LLM harm judgments.
StereoTales: A Multilingual Framework for Open-Ended Stereotype Discovery in LLMs
cs.CY 2026-05 accept novelty 7.0

StereoTales shows that all tested LLMs emit harmful stereotypes in open-ended stories, with associations adapting to prompt language and targeting locally salient groups rather than transferring uniformly across languages.
Debiasing Without Protected Attributes: Latent Concept Erasure from Textual Profiles
cs.CL 2026-06 unverdicted novelty 6.0

H-SAL erases latent concepts from text profiles using self-descriptions as implicit debiasing signals and shows competitive performance on a new multi-domain Stack Exchange helpfulness benchmark.
Trustworthy AI Suffers from Invariance Conflicts and Causality is The Solution
cs.AI 2026-05 unverdicted novelty 6.0

Causality provides a unifying framework for resolving trade-offs in trustworthy AI by managing invariance conflicts under changes to the data-generating process.
Can We Trust a Black-box LLM? LLM Untrustworthy Boundary Detection via Bias-Diffusion and Multi-Agent Reinforcement Learning
cs.AI 2026-04 unverdicted novelty 6.0

GMRL-BD detects untrustworthy topic boundaries for black-box LLMs by combining bias-diffusion on a Wikipedia KG with multi-agent RL, supported by a released dataset labeling biases in models like Llama2 and Qwen2.
ART: Automatic multi-step reasoning and tool-use for large language models
cs.CL 2023-03 unverdicted novelty 6.0

ART automatically generates multi-step reasoning programs with tool integration for LLMs, yielding substantial gains over few-shot and auto-CoT prompting on BigBench and MMLU while matching hand-crafted CoT on most tasks.
DebiasRAG: A Tuning-Free Path to Fair Generation in Large Language Models through Retrieval-Augmented Generation
cs.CL 2026-05 unverdicted novelty 5.0

DebiasRAG uses a three-stage RAG process to generate and rerank query-specific debiasing contexts that act as fairness constraints for LLM outputs.
Mitigating Extrinsic Gender Bias for Bangla Classification Tasks
cs.CL 2024-11 unverdicted novelty 5.0

Constructs gender-perturbed Bangla classification benchmarks and proposes RandSymKL debiasing that reduces extrinsic gender bias in pretrained models.
TrustLLM: Trustworthiness in Large Language Models
cs.CL 2024-01 unverdicted novelty 5.0

TrustLLM defines eight trustworthiness principles, creates a six-dimension benchmark, and evaluates 16 LLMs showing proprietary models generally lead but some open-source ones are close while over-calibration can hurt...
StarCoder: may the source be with you!
cs.CL 2023-05 accept novelty 5.0

StarCoderBase matches or beats OpenAI's code-cushman-001 on multi-language code benchmarks; the Python-fine-tuned StarCoder reaches 40% pass@1 on HumanEval while retaining other-language performance.
Trustworthy AI Suffers from Invariance Conflicts and Causality is The Solution
cs.AI 2026-05 unverdicted novelty 4.0

Causality resolves trade-offs in trustworthy AI by treating them as invariance conflicts under different data-generating process changes.
Bias in Large Language Models: Origin, Evaluation, and Mitigation
cs.CL 2024-11 unverdicted novelty 2.0

A literature review that categorizes bias in LLMs, surveys evaluation and mitigation techniques, and discusses ethical implications.