LLM-as-a-Judge systems exhibit significant biases in specific tasks despite strong overall performance, as measured by the new CALM quantification framework.
If the prompt allows for responses that contain clear logical fallacies but still lead to a correct result, this is considered Fallacy-Oversight Bias
1 Pith paper cite this work. Polarity classification is still indexing.
1
Pith paper citing it
fields
cs.CL 1years
2024 1verdicts
UNVERDICTED 1representative citing papers
citing papers explorer
-
Justice or Prejudice? Quantifying Biases in LLM-as-a-Judge
LLM-as-a-Judge systems exhibit significant biases in specific tasks despite strong overall performance, as measured by the new CALM quantification framework.