LLMs are more accurate when answers match stereotypes in clear contexts, especially for race-gender combinations, and no tested model shows consistent fairness or reliability across intersectional groups.
arXiv preprint arXiv:2503.06987(2025)
2 Pith papers cite this work. Polarity classification is still indexing.
2
Pith papers citing it
citation-role summary
background 1
citation-polarity summary
years
2026 2verdicts
UNVERDICTED 2roles
background 1polarities
background 1representative citing papers
LLM judges for code tasks show high sensitivity to prompt biases that systematically favor certain options, changing accuracy and model rankings even when code is unchanged.
citing papers explorer
-
Intersectional Fairness in Large Language Models
LLMs are more accurate when answers match stereotypes in clear contexts, especially for race-gender combinations, and no tested model shows consistent fairness or reliability across intersectional groups.
-
Bias in the Loop: Auditing LLM-as-a-Judge for Software Engineering
LLM judges for code tasks show high sensitivity to prompt biases that systematically favor certain options, changing accuracy and model rankings even when code is unchanged.