pith. sign in

arxiv: 2406.13925 · v3 · pith:UXG7SVDJnew · submitted 2024-06-20 · 💻 cs.CL · cs.AI

GenderAlign: An Alignment Dataset for Mitigating Gender Bias in Large Language Models

classification 💻 cs.CL cs.AI
keywords genderbiasalignmentllmsbiasesdatasetgenderalignavailable
0
0 comments X
read the original abstract

Large Language Models (LLMs) are prone to generating content that exhibits gender biases, raising significant ethical concerns. Alignment, the process of fine-tuning LLMs to better align with desired behaviors, is recognized as an effective approach to mitigate gender biases. Although proprietary LLMs have made significant strides in mitigating gender bias, their alignment datasets are not publicly available. The commonly used and publicly available alignment dataset, HH-RLHF, still exhibits gender bias to some extent. There is a lack of publicly available alignment datasets specifically designed to address gender bias. Hence, we developed a new dataset named GenderAlign, aiming at mitigating a comprehensive set of gender biases in LLMs. This dataset comprises 8k single-turn dialogues, each paired with a "chosen" and a "rejected" response. Compared to the "rejected" responses, the "chosen" responses demonstrate lower levels of gender bias and higher quality. Furthermore, we categorized the gender biases in the "rejected" responses of GenderAlign into 4 principal categories. The experimental results show the effectiveness of GenderAlign in reducing gender bias in LLMs.

This paper has not been read by Pith yet.

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Thinking About Thinking: Evaluating Reasoning in Post-Trained Language Models

    cs.CL 2025-10 unverdicted novelty 5.0

    RL post-trained models show stronger awareness of learned policies and better generalization to new tasks than SFT models, but display weaker alignment between internal reasoning traces and final outputs, especially u...