UrduMMLU is a new native-source MCQ benchmark for Urdu that reveals top LLMs reach only ~90% accuracy with large gaps on region-specific humanities content.
hub
MMLU - P ro X : A Multilingual Benchmark for Advanced Large Language Model Evaluation
11 Pith papers cite this work. Polarity classification is still indexing.
hub tools
fields
cs.CL 11years
2026 11representative citing papers
LLMs can provide cost-effective annotation of credibility in Danish asylum texts but produce inconsistent errors that vary by model and prompt, requiring checks beyond single-model accuracy.
MultiSynt/MT supplies 4.8 trillion translated tokens in 36 languages from 100B English tokens, letting LLMs match native-data baselines with 72% fewer tokens and beat them by 15% at equal budget.
SOLAR aligns soft-token probability mixtures across languages in embedding space during SFT and raises multilingual reasoning accuracy by up to 17.7 points over the base model.
SARA aligns internal routing distributions in MoE layers to high-resource semantic anchors via symmetric JS divergence, improving low-resource language performance by 0.8-1.2% over standard instruction tuning on Global-MMLU.
Luar is a reinforcement learning method enabling reasoning language models to decide when to invoke English translation for improved multilingual reasoning.
Low-resource languages are structurally more different from English in LLMs than high- or mid-resource ones, and language-specific post-training alters structures while preserving inter-language relationships.
A Bayesian framework decomposes mLLM variance, showing language features explain 79-92% of language identity variance and that model identity vs. benchmark-model interactions dominate differently for understanding versus reasoning tasks.
Macro uses DPO on composite preference pairs to raise validity of multilingual self-generated counterfactual explanations by 12.55% on average over chain-of-thought while preserving minimality.
DuDi is a dual-signal distillation method with cross-lingual verbalizer that improves multilingual SLM performance on SEA languages and outperforms baselines on SEA-HELM.
A multilingual self-consistency plus self-critique method raises cultural alignment scores on English queries by 5.03% on the BLEnD benchmark using only self-generated data.
citing papers explorer
-
UrduMMLU: A Massive Multitask Benchmark for Urdu Language Understanding
UrduMMLU is a new native-source MCQ benchmark for Urdu that reveals top LLMs reach only ~90% accuracy with large gaps on region-specific humanities content.
-
LLMs as annotators of credibility assessment in Danish asylum decisions: evaluating classification performance and errors beyond aggregated metrics
LLMs can provide cost-effective annotation of credibility in Danish asylum texts but produce inconsistent errors that vary by model and prompt, requiring checks beyond single-model accuracy.
-
MultiSynt/MT: Trillion-Token Multi-Parallel Pre-Training Data Translated Across 36 Languages
MultiSynt/MT supplies 4.8 trillion translated tokens in 36 languages from 100B English tokens, letting LLMs match native-data baselines with 72% fewer tokens and beat them by 15% at equal budget.
-
Soft Token Alignment for Cross-Lingual Reasoning
SOLAR aligns soft-token probability mixtures across languages in embedding space during SFT and raises multilingual reasoning accuracy by up to 17.7 points over the base model.
-
SARA: Unlocking Multilingual Knowledge in Mixture-of-Experts via Semantically Anchored Routing Alignment
SARA aligns internal routing distributions in MoE layers to high-resource semantic anchors via symmetric JS divergence, improving low-resource language performance by 0.8-1.2% over standard instruction tuning on Global-MMLU.
-
Learning When to Translate for Multilingual Reasoning
Luar is a reinforcement learning method enabling reasoning language models to decide when to invoke English translation for improved multilingual reasoning.
-
Multilinguality of Large Language Models From a Structural Perspective
Low-resource languages are structurally more different from English in LLMs than high- or mid-resource ones, and language-specific post-training alters structures while preserving inter-language relationships.
-
DEPART: DEcomposing PARiTy across Multilingual LLMs
A Bayesian framework decomposes mLLM variance, showing language features explain 79-92% of language identity variance and that model identity vs. benchmark-model interactions dominate differently for understanding versus reasoning tasks.
-
Macro: Enhancing Multilingual Counterfactual Explanations through Alignment-as-Preference Optimization
Macro uses DPO on composite preference pairs to raise validity of multilingual self-generated counterfactual explanations by 12.55% on average over chain-of-thought while preserving minimality.
-
DuDi: Dual-Signal Distillation with Cross-Lingual Verbalizer
DuDi is a dual-signal distillation method with cross-lingual verbalizer that improves multilingual SLM performance on SEA languages and outperforms baselines on SEA-HELM.
-
Cross-Lingual Consensus: Aligning Multilingual Cultural Knowledge via Multilingual Self-Consistency
A multilingual self-consistency plus self-critique method raises cultural alignment scores on English queries by 5.03% on the BLEnD benchmark using only self-generated data.