VOIR DIRE benchmark shows MLLM-as-a-Judge systems decompose into positivity-floor calibration failure and orientation failure on culturally contested items, with persona prompting recovering only the former.
F oodie QA : A Multimodal Dataset for Fine-Grained Understanding of C hinese Food Culture
2 Pith papers cite this work. Polarity classification is still indexing.
2
Pith papers citing it
fields
cs.CV 2years
2026 2verdicts
UNVERDICTED 2representative citing papers
ChinaHeritaQA is a new bilingual VQA benchmark dataset with 2,279 images and 14,133 QA pairs for evaluating cultural reasoning abilities of VLMs on Chinese World Heritage sites across seven cognitive dimensions.
citing papers explorer
-
Jury Duty: Calibration and Orientation Failures in MLLM-as-a-Judge Under Cultural Ambiguity
VOIR DIRE benchmark shows MLLM-as-a-Judge systems decompose into positivity-floor calibration failure and orientation failure on culturally contested items, with persona prompting recovering only the former.
-
ChinaHeritaQA: A Culturally-Grounded Visual Question Answering Dataset for World Heritage Sites in China
ChinaHeritaQA is a new bilingual VQA benchmark dataset with 2,279 images and 14,133 QA pairs for evaluating cultural reasoning abilities of VLMs on Chinese World Heritage sites across seven cognitive dimensions.