JuICE is a new multilingual benchmark dataset showing top LLM judges reach only F1 0.52 on span-level cultural error detection and miss errors locals readily spot.
Title resolution pending
2 Pith papers cite this work. Polarity classification is still indexing.
2
Pith papers citing it