UrduMMLU is a new native-source MCQ benchmark for Urdu that reveals top LLMs reach only ~90% accuracy with large gaps on region-specific humanities content.
Title resolution pending
5 Pith papers cite this work. Polarity classification is still indexing.
years
2026 5verdicts
UNVERDICTED 5representative citing papers
An audit of one million Korean synthetic personas shows marginal demographic alignment does not preserve joint distributions, with three specific mismatches identified via a new Independence-Assumption Footprint method.
Using a 1PL IRT model on real cultural questions across 13 locales, the study identifies a local-language knowledge-access advantage masked by lower proficiency in raw accuracy.
Presents a Korean harm taxonomy, culturally grounded safe-response guidelines, and DPO fine-tuning that raises cultural safe rates on six open-weight LLMs with little benchmark degradation.
K-BrowseComp is a new Korean web-browsing agent benchmark where frontier LLMs score 30-46% and Korean LLMs score 0-10% on the verified subset.
citing papers explorer
-
UrduMMLU: A Massive Multitask Benchmark for Urdu Language Understanding
UrduMMLU is a new native-source MCQ benchmark for Urdu that reveals top LLMs reach only ~90% accuracy with large gaps on region-specific humanities content.
-
The Masked Advantage: Uncovering Local-Language Access to Cultural Knowledge in LLMs
Using a 1PL IRT model on real cultural questions across 13 locales, the study identifies a local-language knowledge-access advantage masked by lower proficiency in raw accuracy.
-
Korean Culture into LLM Alignment: Toward Cultural Coherence
Presents a Korean harm taxonomy, culturally grounded safe-response guidelines, and DPO fine-tuning that raises cultural safe rates on six open-weight LLMs with little benchmark degradation.
-
K-BrowseComp: A Web Browsing Agent Benchmark Grounded in Korean Contexts
K-BrowseComp is a new Korean web-browsing agent benchmark where frontier LLMs score 30-46% and Korean LLMs score 0-10% on the verified subset.