Standardized-test benchmarks for LLM fairness are unreliable because prompt wording alone drives most score variance and ranking changes, while a multi-agent conversational framework reveals consistent model-specific fairness behaviors across millions of dialogues.
Gender bias in coreference resolution: Evaluation and debiasing methods
1 Pith paper cite this work. Polarity classification is still indexing.
1
Pith paper citing it
fields
cs.CL 1years
2026 1verdicts
UNVERDICTED 1representative citing papers
citing papers explorer
-
In-Situ Behavioral Evaluation for LLM Fairness, Not Standardized-Test Scores
Standardized-test benchmarks for LLM fairness are unreliable because prompt wording alone drives most score variance and ranking changes, while a multi-agent conversational framework reveals consistent model-specific fairness behaviors across millions of dialogues.