MM-JudgeBias benchmark shows that many MLLM judges neglect modalities and produce unstable evaluations under small input changes, based on tests of 26 models with over 1,800 samples.
How many organisms in this food web are simultaneously predators and prey, consume at least one primary producer, and are also located on the right half of the diagram?
1 Pith paper cite this work. Polarity classification is still indexing.
1
Pith paper citing it
fields
cs.CL 1years
2026 1verdicts
UNVERDICTED 1representative citing papers
citing papers explorer
-
MM-JudgeBias: A Benchmark for Evaluating Compositional Biases in MLLM-as-a-Judge
MM-JudgeBias benchmark shows that many MLLM judges neglect modalities and produce unstable evaluations under small input changes, based on tests of 26 models with over 1,800 samples.