UrduMMLU: A Massive Multitask Benchmark for Urdu Language Understanding

Ahmer Tabassum; Hasan Iqbal; Momina Ahsan; Owais Aijaz; Preslav Nakov; Sarfraz Ahmad

arxiv: 2606.07167 · v1 · pith:ZRE5QRQZnew · submitted 2026-06-05 · 💻 cs.CL · cs.AI

UrduMMLU: A Massive Multitask Benchmark for Urdu Language Understanding

Ahmer Tabassum , Sarfraz Ahmad , Hasan Iqbal , Owais Aijaz , Momina Ahsan , Preslav Nakov This is my paper

Pith reviewed 2026-06-27 21:53 UTC · model grok-4.3

classification 💻 cs.CL cs.AI

keywords UrduMMLUUrdu benchmarkmultilingual evaluationlarge language modelsnative MCQszero-shot evaluationhuman annotationhumanities vs STEM

0 comments

The pith

UrduMMLU shows LLMs reach at most 90% on native Urdu questions, with sharp drops on humanities and region-specific content.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper creates UrduMMLU as a collection of 26,431 multiple-choice questions drawn from Urdu educational sources rather than translations. It runs zero-shot and few-shot evaluations on 30 large language models using both English and Urdu prompts. Results place the best model at roughly 90 percent accuracy while all others stay below 85 percent, with many losing 25 to 40 points on Urdu-centered humanities subjects relative to STEM. Few-shot prompting adds only modest gains. The work establishes that current models hold uneven knowledge of Urdu in its native educational context.

Core claim

UrduMMLU supplies 26,431 Urdu MCQs spanning 26 subjects and five domains, sourced from native MCQ banks and public examination PDFs, with dual human annotation and consensus filtering applied to the exam-derived items. Zero-shot testing of 30 LLMs under English and Urdu prompts finds Gemini-3.5-Flash highest at 90.20 percent and 90.34 percent accuracy; no other model exceeds 85 percent. The strongest open-source model trails by 7.79 and 8.92 points. Many models lose 25 to 40 points on Urdu-centered humanities relative to STEM, and few-shot settings produce only small improvements.

What carries the argument

UrduMMLU, a native Urdu multitask MCQ benchmark built from local educational materials with dual human annotation and consensus filtering, used to run 60 zero-shot and additional few-shot evaluations across prompt languages.

If this is right

Native-sourced benchmarks are required to detect performance shortfalls that translation-based tests miss.
Current LLMs need additional Urdu-region-specific training data to reduce large gaps on humanities subjects.
Few-shot prompting alone does not close the observed deficits in Urdu understanding.
Open-source models lag proprietary ones by roughly eight points on this benchmark.
Urdu knowledge in LLMs remains uneven across domains, with STEM outperforming culture- and region-linked content.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Developers of multilingual models may benefit from systematically collecting native exam materials for other languages to surface similar hidden gaps.
Real-world educational tools built on current LLMs could underperform for Urdu speakers on locally grounded topics.
Performance differences between prompt languages suggest that evaluation protocols should routinely test both the target language and English.
The modest few-shot gains imply that scaling native pretraining data may be more effective than prompting changes.

Load-bearing premise

The 26,431 questions gathered from native Urdu MCQ banks and public examination PDFs, after dual human annotation with consensus filtering, constitute a representative and reliable test of Urdu language understanding in its educational context.

What would settle it

If models achieve comparable accuracy on an English translation of the same questions as on the original Urdu version, the claim that native sources uniquely expose gaps would be weakened.

Figures

Figures reproduced from arXiv: 2606.07167 by Ahmer Tabassum, Hasan Iqbal, Momina Ahsan, Owais Aijaz, Preslav Nakov, Sarfraz Ahmad.

**Figure 2.** Figure 2: STEM and Humanities accuracy on URDUMMLU under the Urdu prompt for top representative models from each model group. All models score lower on Humanities. Under the Urdu prompt, Gemini-3.5-Flash scores 97.81% on STEM and 85.31% on Humanities, a gap of 12.50 points, while DeepSeek-V4- Flash drops from 97.57% to 67.32%. GPT-5.4 and Claude-Sonnet-4.6 lose more than 22 points, and several Qwen models lose mor… view at source ↗

**Figure 4.** Figure 4: Few-shot accuracy on URDUMMLU for LLaMA-3.1-8B (Grattafiori et al., 2024), Gemma-3-4BIT, Qwen3-8B, and Qwen3-4B-Instruct-2507 under English (solid) and Urdu (dotted) prompts. Accuracy generally improves from zero-shot to five-shot across both prompt languages, although the gains remain modest. We evaluate each model at 1-, 3-, and 5-shot under English and Urdu prompts using validated demonstrations from… view at source ↗

**Figure 5.** Figure 5: Distribution of URDUMMLU items across Pakistani examination levels, grouped by domain. [PITH_FULL_IMAGE:figures/full_fig_p014_5.png] view at source ↗

**Figure 6.** Figure 6: Question-length analysis for UrduMMLU. Left: histogram of question lengths. Right: domain-wise counts of short (≤ 9 words) and long (> 9 words) questions. STEM is closest to a balanced split, while Humanities and Profession skew shorter. These changes reflect a deliberate balancing step rather than artifacts of preprocessing or cleaning. We down-sample overrepresented STEM and Social Sciences items and pr… view at source ↗

**Figure 7.** Figure 7: Final URDUMMLU item counts by subdomain, grouped by domain. Urdu Literature and Urdu Language contribute the largest shares, while Social Sciences and STEM distribute across a larger number of medium-sized subdomains. B.1 Annotator Demographics and Feedback The annotation pool consisted of 17 annotators recruited for native Urdu fluency and familiarity with the Pakistani school curriculum [PITH_FULL_IMA… view at source ↗

**Figure 8.** Figure 8: Examples of the five flag categories used in the annotation guidelines. Annotators were asked to flag the [PITH_FULL_IMAGE:figures/full_fig_p017_8.png] view at source ↗

**Figure 9.** Figure 9: Examples of the most common edit categories permitted by the annotation guidelines. Edits are restricted [PITH_FULL_IMAGE:figures/full_fig_p017_9.png] view at source ↗

**Figure 10.** Figure 10: Annotation dashboard workflow for answer selection. [PITH_FULL_IMAGE:figures/full_fig_p019_10.png] view at source ↗

**Figure 11.** Figure 11: Annotation dashboard workflow for text correction and normalization. [PITH_FULL_IMAGE:figures/full_fig_p019_11.png] view at source ↗

**Figure 12.** Figure 12: Annotation dashboard workflow for flagging problematic items. [PITH_FULL_IMAGE:figures/full_fig_p019_12.png] view at source ↗

**Figure 13.** Figure 13: Pairwise annotator agreement on final-included MCQs. Each cell reports simplified Cohen’s [PITH_FULL_IMAGE:figures/full_fig_p020_13.png] view at source ↗

**Figure 14.** Figure 14: Dataset-level sanity checks for URDUMMLU. Most questions remain compact enough for standard MCQ [PITH_FULL_IMAGE:figures/full_fig_p021_14.png] view at source ↗

**Figure 15.** Figure 15: JSON schema for individual URDUMMLU question items. We also preserve the same two-line response structure using the English fields Answer key: and Answer text:, which allows a single parser to process outputs under both prompt languages. This design ensures that prompt language is the only substantive difference between the two evaluation settings [PITH_FULL_IMAGE:figures/full_fig_p021_15.png] view at source ↗

**Figure 17.** Figure 17: Urdu prompt for multiple-choice question [PITH_FULL_IMAGE:figures/full_fig_p024_17.png] view at source ↗

**Figure 18.** Figure 18: Urdu literature accuracy for four 8Bclass instruction-tuned models under English and Urdu prompts. Ministral-3-8B performs best under both settings, while Qwen3-8B shows the largest promptlanguage drop. E.1.4 English-Prompt Subdomain Accuracy [PITH_FULL_IMAGE:figures/full_fig_p026_18.png] view at source ↗

read the original abstract

Meaningful multilingual evaluation must test models in the target language and educational context. Urdu, spoken by more than 230 million people, lacks a broad MMLU-style benchmark built from native educational sources. We introduce UrduMMLU, a benchmark of 26,431 Urdu MCQs across 26 subjects and five domains, collected from native Urdu MCQ banks and public examination PDFs. Unlike translation-based resources, UrduMMLU covers both standard academic subjects and Urdu- and region-specific content. We label the exam-derived portion through dual human annotation with strict consensus filtering. We evaluate 30 LLMs under English and Urdu prompts, yielding 60 zero-shot evaluations, and further evaluate four open-source LLMs under multiple few-shot settings across both prompt languages. Gemini-3.5-Flash performs best, reaching 90.20% and 90.34% accuracy, while no other model exceeds 85%. The strongest open-source model trails by 7.79 and 8.92 points, and many models lose 25 to 40 points on Urdu-centered Humanities subjects compared with STEM. Few-shot prompting yields only modest gains. UrduMMLU shows that Urdu knowledge remains uneven in current LLMs, especially for regionally grounded content.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

UrduMMLU gives a practical native benchmark for a major language, but the abstract leaves annotation reliability and error details thin.

read the letter

The main point is that this paper builds the first large MMLU-style test set for Urdu straight from local exam PDFs and MCQ banks instead of translations.

It gathers 26k questions over 26 subjects and five domains, applies dual annotation with consensus filtering on the exam portion, and runs 60 zero-shot evaluations plus some few-shot tests on 30 models. The results show Gemini-3.5-Flash at the top around 90 percent while others lag, with bigger drops on humanities and region-specific items. That pattern is worth knowing for anyone working on multilingual models.

The collection approach and the coverage of Urdu-centered content are the real additions. Benchmark papers like this help set baselines where none existed, and the scale plus the native sourcing make the resource usable.

The soft spot is that the provided abstract gives no inter-annotator agreement numbers, no error analysis, and no statistical tests on the labels. Without those, it is hard to judge how noisy the test set actually is. The performance claims rest on the data quality, so the full paper needs to show those checks clearly.

This is for researchers doing multilingual evaluation or building tools for South Asian languages. It is the sort of work that deserves referee time because the gap it fills is real and the evaluations are straightforward to check once the annotation details are there.

Referee Report

3 major / 3 minor

Summary. The paper introduces UrduMMLU, a benchmark consisting of 26,431 Urdu-language multiple-choice questions spanning 26 subjects and five domains, sourced from native Urdu MCQ banks and public examination PDFs rather than translations. It describes a dual human annotation protocol with consensus filtering for the exam-derived portion, followed by zero-shot evaluation of 30 LLMs (and few-shot for four open-source models) under both English and Urdu prompts. The central empirical claims are that Gemini-3.5-Flash achieves the highest accuracies (90.20% and 90.34%), no other model exceeds 85%, the strongest open-source model trails by 7.79–8.92 points, and many models drop 25–40 points on Urdu-centered humanities subjects relative to STEM; few-shot prompting yields only modest gains.

Significance. If the benchmark construction and labels prove reliable, UrduMMLU would provide a valuable native-context resource for assessing LLM performance in a widely spoken but under-evaluated language, exposing gaps in region-specific and humanities knowledge that translation-based benchmarks may miss. The scale (26k questions) and coverage of both standard and Urdu-specific content strengthen its potential utility for future multilingual work.

major comments (3)

[§3] §3 (Data Collection and Annotation): The dual-annotation protocol with consensus filtering is described, but no inter-annotator agreement statistics, disagreement rates, or resolution process details are reported. These metrics are load-bearing for the claim that the 26,431 questions constitute a reliable test set, as the performance gaps (e.g., humanities vs. STEM drops) could be inflated by label noise.
[§5] §5 (Experiments): The zero-shot results for Gemini-3.5-Flash and the 7.79–8.92 point gap to the strongest open-source model are presented without statistical significance tests, confidence intervals, or error analysis across subjects. This weakens the central claim that Urdu knowledge remains uneven, particularly the 25–40 point humanities drops.
[§2, §4] §2 and §4: The representativeness claim for the collected MCQs (native banks + exam PDFs) lacks discussion of coverage gaps, potential selection bias toward certain examination boards, or comparison to Urdu educational curricula; this directly affects whether the benchmark supports broad conclusions about LLM Urdu understanding.

minor comments (3)

[Table 1] Table 1 or equivalent: Clarify the exact split between exam-derived and MCQ-bank questions, and report the number of questions per subject/domain for transparency.
[§5.2] §5.2: The few-shot results for the four open-source models are mentioned only briefly; adding per-subject breakdowns or prompt templates would improve reproducibility.
[References] References: Ensure all cited Urdu MCQ sources and prior multilingual benchmarks (e.g., MMLU, other language-specific variants) are fully listed with DOIs or URLs.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive comments, which highlight important areas for strengthening the presentation of our benchmark's construction and evaluation. We address each major comment below and will revise the manuscript accordingly.

read point-by-point responses

Referee: [§3] §3 (Data Collection and Annotation): The dual-annotation protocol with consensus filtering is described, but no inter-annotator agreement statistics, disagreement rates, or resolution process details are reported. These metrics are load-bearing for the claim that the 26,431 questions constitute a reliable test set, as the performance gaps (e.g., humanities vs. STEM drops) could be inflated by label noise.

Authors: We agree that inter-annotator agreement metrics are essential to substantiate label reliability. In the revised manuscript we will add the percentage of initial disagreements, the final agreement rate after consensus, and Cohen's kappa computed on the dual-annotated exam-derived subset, together with a brief description of the resolution procedure (third annotator review for persistent disagreements). These additions directly address the concern that label noise could exaggerate domain gaps. revision: yes
Referee: [§5] §5 (Experiments): The zero-shot results for Gemini-3.5-Flash and the 7.79–8.92 point gap to the strongest open-source model are presented without statistical significance tests, confidence intervals, or error analysis across subjects. This weakens the central claim that Urdu knowledge remains uneven, particularly the 25–40 point humanities drops.

Authors: We accept that statistical support is needed. The revision will include bootstrap-derived 95% confidence intervals for all reported accuracies, McNemar's tests for pairwise model comparisons, and an expanded per-subject error analysis that quantifies variance and highlights the humanities–STEM disparity with subject-level breakdowns. These additions will make the uneven-knowledge claim more robust. revision: yes
Referee: [§2, §4] §2 and §4: The representativeness claim for the collected MCQs (native banks + exam PDFs) lacks discussion of coverage gaps, potential selection bias toward certain examination boards, or comparison to Urdu educational curricula; this directly affects whether the benchmark supports broad conclusions about LLM Urdu understanding.

Authors: We will expand the relevant paragraphs in §2 and §4 to list the examination boards whose PDFs were used, note the proportion of questions drawn from each source, discuss alignment with the Pakistani national curriculum and Indian state-board syllabi where data permit, and explicitly acknowledge coverage limitations arising from the public availability of PDFs. The revised text will qualify the generalizability claim while retaining the argument that native sourcing captures regionally specific content missed by translations. revision: yes

Circularity Check

0 steps flagged

No significant circularity: empirical benchmark paper with direct measurements only

full rationale

This is a standard benchmark construction and model evaluation paper. The 26,431 questions are collected from native sources, labeled via dual human annotation with consensus filtering, and used for direct zero-shot and few-shot accuracy measurements on 30+ LLMs. No derivations, equations, fitted parameters, predictions, or self-citation chains appear in the abstract or described pipeline; all reported numbers (e.g., Gemini-3.5-Flash at 90.20%/90.34%) are raw empirical results on the collected data. The derivation chain is empty by construction, so no circular steps exist.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The paper relies on standard domain assumptions in LLM benchmarking without introducing new free parameters or invented entities.

axioms (2)

domain assumption Multiple-choice questions drawn from educational materials are a valid proxy for language understanding and subject knowledge.
Invoked throughout the benchmark construction and evaluation sections of the abstract.
domain assumption Dual human annotation with strict consensus filtering produces reliable ground-truth labels.
Stated as the labeling method for the exam-derived portion.

pith-pipeline@v0.9.1-grok · 5772 in / 1450 out tokens · 29816 ms · 2026-06-27T21:53:28.257055+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

45 extracted references · 18 canonical work pages

[1]

Shafique, Muhammad and Mehboob, Areej and Fiaz, Layba and Qadeer, Muhammad and Farooq, Hamza , year =
[2]

ACM Trans

Kazi, Samreen and Khoja, Shakeel , title =. ACM Trans. Asian Low-Resour. Lang. Inf. Process. , month = feb, articleno =. 2026 , issue_date =. doi:10.1145/3759455 , abstract =

work page doi:10.1145/3759455 2026
[3]

Crossing Language Boundaries: Evaluation of Large Language Models on U rdu- E nglish Question Answering

Kazi, Samreen and Rahim, Maria and Khoja, Shakeel Ahmed. Crossing Language Boundaries: Evaluation of Large Language Models on U rdu- E nglish Question Answering. Proceedings of the First Workshop on Natural Language Processing for Indo-Aryan and Dravidian Languages. 2025

2025
[4]

Adeeba, Farah and Dillon, Brian and Sajjad, Hassan and Bhatt, Rajesh , year =
[5]

Benchmarking the Performance of Pre-trained LLM s across U rdu NLP Tasks

Tahir, Munief Hassan and Shams, Sana and Fiaz, Layba and Adeeba, Farah and Hussain, Sarmad. Benchmarking the Performance of Pre-trained LLM s across U rdu NLP Tasks. Proceedings of the First Workshop on Challenges in Processing South Asian Languages (CHiPSAL 2025). 2025

2025
[6]

International Conference on Learning Representations , year=

Measuring Massive Multitask Language Understanding , author=. International Conference on Learning Representations , year=
[7]

Proceedings of the 38th International Conference on Neural Information Processing Systems , articleno =

Wang, Yubo and Ma, Xueguang and Zhang, Ge and Ni, Yuansheng and Chandra, Abhranil and Guo, Shiguang and Ren, Weiming and Arulraj, Aaran and He, Xuan and Jiang, Ziyan and Li, Tianle and Ku, Max and Wang, Kai and Zhuang, Alex and Fan, Rongqi and Yue, Xiang and Chen, Wenhu , title =. Proceedings of the 38th International Conference on Neural Information Proc...

2024
[8]

MMLU - CF : A Contamination-free Multi-task Language Understanding Benchmark

Zhao, Qihao and Huang, Yangyu and Lv, Tengchao and Cui, Lei and Sun, Qinzheng and Mao, Shaoguang and Zhang, Xin and Xin, Ying and Yin, Qiufeng and Li, Scarlett and Wei, Furu. MMLU - CF : A Contamination-free Multi-task Language Understanding Benchmark. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long ...

work page doi:10.18653/v1/2025.acl-long.656 2025
[9]

Global MMLU : Understanding and Addressing Cultural and Linguistic Biases in Multilingual Evaluation

Singh, Shivalika and Romanou, Angelika and Fourrier, Cl \'e mentine and Adelani, David Ifeoluwa and Ngui, Jian Gang and Vila-Suero, Daniel and Limkonchotiwat, Peerat and Marchisio, Kelly and Leong, Wei Qi and Susanto, Yosephine and Ng, Raymond and Longpre, Shayne and Ruder, Sebastian and Ko, Wei-Yin and Bosselut, Antoine and Oh, Alice and Martins, Andre a...

work page doi:10.18653/v1/2025.acl-long.919 2025
[10]

2501.15747 , archivePrefix =

Sankalp KJ and Ashutosh Kumar and Laxmaan Balaji and Nikunj Kotecha and Vinija Jain and Aman Chadha and Sreyoshi Bhaduri , year =. 2501.15747 , archivePrefix =

arXiv
[11]

MMLU - P ro X : A Multilingual Benchmark for Advanced Large Language Model Evaluation

Xuan, Weihao and Yang, Rui and Qi, Heli and Zeng, Qingcheng and Xiao, Yunze and Feng, Aosong and Liu, Dairui and Xing, Yun and Wang, Junjue and Gao, Fan and Lu, Jinghui and Jiang, Yuang and Li, Huitao and Li, Xin and Yu, Kunyu and Dong, Ruihai and Gu, Shangding and Li, Yuekang and Xie, Xiaofei and Juefei-Xu, Felix and Khomh, Foutse and Yoshie, Osamu and C...

work page doi:10.18653/v1/2025.emnlp-main.79 2025
[12]

Angelika Romanou and Negar Foroutan and Anna Sotnikova and Sree Harsha Nelaturu and Shivalika Singh and Rishabh Maheshwary and Micol Altomare and Zeming Chen and Mohamed A. Haggag and Snegha A and Alfonso Amayuelas and Azril Hafizi Amirudin and Danylo Boiko and Michael Chang and Jenny Chim and Gal Cohen and Aditya Kumar Dalmia and Abraham Diress and Shara...
[13]

EXAMS : A Multi-subject High School Examinations Dataset for Cross-lingual and Multilingual Question Answering

Hardalov, Momchil and Mihaylov, Todor and Zlatkova, Dimitrina and Dinkov, Yoan and Koychev, Ivan and Nakov, Preslav. EXAMS : A Multi-subject High School Examinations Dataset for Cross-lingual and Multilingual Question Answering. Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). 2020. doi:10.18653/v1/2020.emnlp...

work page doi:10.18653/v1/2020.emnlp-main.438 2020
[14]

MILU : A Multi-task I ndic Language Understanding Benchmark

Verma, Sshubam and Khan, Mohammed Safi Ur Rahman and Kumar, Vishwajeet and Murthy, Rudra and Sen, Jaydeep. MILU : A Multi-task I ndic Language Understanding Benchmark. Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers). 2025. doi:10...

work page doi:10.18653/v1/2025.naacl-long.507 2025
[15]

A rabic MMLU : Assessing Massive Multitask Language Understanding in A rabic

Koto, Fajri and Li, Haonan and Shatnawi, Sara and Doughman, Jad and Sadallah, Abdelrahman and Alraeesi, Aisha and Almubarak, Khalid and Alyafeai, Zaid and Sengupta, Neha and Shehata, Shady and Habash, Nizar and Nakov, Preslav and Baldwin, Timothy. A rabic MMLU : Assessing Massive Multitask Language Understanding in A rabic. Findings of the Association for...

work page doi:10.18653/v1/2024.findings-acl.334 2024
[16]

CMMLU : Measuring massive multitask language understanding in C hinese

Li, Haonan and Zhang, Yixuan and Koto, Fajri and Yang, Yifei and Zhao, Hai and Gong, Yeyun and Duan, Nan and Baldwin, Timothy. CMMLU : Measuring massive multitask language understanding in C hinese. Findings of the Association for Computational Linguistics: ACL 2024. 2024. doi:10.18653/v1/2024.findings-acl.671

work page doi:10.18653/v1/2024.findings-acl.671 2024
[17]

Large Language Models Only Pass Primary School Exams in I ndonesia: A Comprehensive Test on I ndo MMLU

Koto, Fajri and Aisyah, Nurul and Li, Haonan and Baldwin, Timothy. Large Language Models Only Pass Primary School Exams in I ndonesia: A Comprehensive Test on I ndo MMLU. Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing. 2023. doi:10.18653/v1/2023.emnlp-main.760

work page doi:10.18653/v1/2023.emnlp-main.760 2023
[18]

KMMLU : Measuring Massive Multitask Language Understanding in K orean

Son, Guijin and Lee, Hanwool and Kim, Sungdong and Kim, Seungone and Muennighoff, Niklas and Choi, Taekyoon and Park, Cheonbok and Yoo, Kang Min and Biderman, Stella. KMMLU : Measuring Massive Multitask Language Understanding in K orean. Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguist...

work page doi:10.18653/v1/2025.naacl-long.206 2025
[19]

K az MMLU : Evaluating Language Models on K azakh, R ussian, and Regional Knowledge of K azakhstan

Togmanov, Mukhammed and Mukhituly, Nurdaulet and Turmakhan, Diana and Mansurov, Jonibek and Goloburda, Maiya and Sakip, Akhmed and Xie, Zhuohan and Wang, Yuxia and Syzdykov, Bekassyl and Laiyk, Nurkhan and Aji, Alham Fikri and Kochmar, Ekaterina and Nakov, Preslav and Koto, Fajri. K az MMLU : Evaluating Language Models on K azakh, R ussian, and Regional K...

work page doi:10.18653/v1/2025.acl-long.701 2025
[20]

Alif: Advancing U rdu Large Language Models via Multilingual Synthetic Data Distillation

Shafique, Muhammad Ali and Mehreen, Kanwal and Arham, Muhammad and Amjad, Maaz and Butt, Sabur and Farooq, Hamza. Alif: Advancing U rdu Large Language Models via Multilingual Synthetic Data Distillation. Proceedings of the 5th Workshop on Multilingual Representation Learning (MRL 2025). 2025. doi:10.18653/v1/2025.mrl-main.19

work page doi:10.18653/v1/2025.mrl-main.19 2025
[21]

2026 , eprint=

Ministral 3 , author=. 2026 , eprint=

2026
[22]

Qalb: Largest State-of-the-Art

Muhammad Taimoor Hassan and Jawad Ahmed and Muhammad Awais , year=. Qalb: Largest State-of-the-Art. 2601.08141 , archivePrefix=

arXiv
[23]

2505.09388 , archivePrefix=

An Yang and Anfeng Li and Baosong Yang and Beichen Zhang and Binyuan Hui and Bo Zheng and Bowen Yu and Chang Gao and Chengen Huang and Chenxu Lv and Chujie Zheng and Dayiheng Liu and Fan Zhou and Fei Huang and Feng Hu and Hao Ge and Haoran Wei and Huan Lin and Jialong Tang and Jian Yang and Jianhong Tu and Jianwei Zhang and Jianxin Yang and Jiaxi Yang and...

Pith/arXiv arXiv
[24]

Crosslingual Generalization through Multitask Finetuning

Muennighoff, Niklas and Wang, Thomas and Sutawika, Lintang and Roberts, Adam and Biderman, Stella and Le Scao, Teven and Bari, M Saiful and Shen, Sheng and Yong, Zheng Xin and Schoelkopf, Hailey and Tang, Xiangru and Radev, Dragomir and Aji, Alham Fikri and Almubarak, Khalid and Albanie, Samuel and Alyafeai, Zaid and Webson, Albert and Raff, Edward and Ra...

work page doi:10.18653/v1/2023.acl-long.891 2023
[25]

arXiv preprint arXiv:2503.19786 , year =

Gemma 3 Technical Report , author =. arXiv preprint arXiv:2503.19786 , year =

Pith/arXiv arXiv
[26]

arXiv preprint arXiv:2403.08295 , year =

Gemma: Open Models Based on Gemini Research and Technology , author =. arXiv preprint arXiv:2403.08295 , year =

Pith/arXiv arXiv
[27]

2026 , month =

Anthropic , title =. 2026 , month =

2026
[28]

2026 , howpublished =

Gemma 4 Model Card , author =. 2026 , howpublished =

2026
[29]

Aaron Grattafiori and Abhimanyu Dubey and Abhinav Jauhri and Abhinav Pandey and Abhishek Kadian and Ahmad Al-Dahle and Aiesha Letman and Akhil Mathur and Alan Schelten and Alex Vaughan and Amy Yang and Angela Fan and Anirudh Goyal and Anthony Hartshorn and Aobo Yang and Archi Mitra and Archie Sravankumar and Artem Korenev and Arthur Hinsvark and Arun Rao ...

Pith/arXiv arXiv
[30]

2024 , howpublished =

2024
[31]

2025 , howpublished =

2025
[32]

Marah Abdin and Jyoti Aneja and Hany Awadalla and Ahmed Awadallah and Ammar Ahmad Awan and Nguyen Bach and Amit Bahree and Arash Bakhtiari and Jianmin Bao and Harkirat Behl and Alon Benhaim and Misha Bilenko and Johan Bjorck and Sébastien Bubeck and Martin Cai and Qin Cai and Vishrav Chaudhary and Dong Chen and Dongdong Chen and Weizhu Chen and Yen-Chun C...

Pith/arXiv arXiv
[33]

2503.01743 , archivePrefix=

Microsoft and : and Abdelrahman Abouelenin and Atabak Ashfaq and Adam Atkinson and Hany Awadalla and Nguyen Bach and Jianmin Bao and Alon Benhaim and Martin Cai and Vishrav Chaudhary and Congcong Chen and Dong Chen and Dongdong Chen and Junkun Chen and Weizhu Chen and Yen-Chun Chen and Yi-ling Chen and Qi Dai and Xiyang Dai and Ruchao Fan and Mei Gao and ...

Pith/arXiv arXiv
[34]

2026 , howpublished =

2026
[35]

2026 , howpublished =

Introducing. 2026 , howpublished =

2026
[36]

Gemma Team and Morgane Riviere and Shreya Pathak and Pier Giuseppe Sessa and Cassidy Hardin and Surya Bhupatiraju and Léonard Hussenot and Thomas Mesnard and Bobak Shahriari and Alexandre Ramé and Johan Ferret and Peter Liu and Pouya Tafti and Abe Friesen and Michelle Casbon and Sabela Ramos and Ravin Kumar and Charline Le Lan and Sammy Jerome and Anton T...

Pith/arXiv arXiv
[37]

The Eleventh International Conference on Learning Representations , year=

Language models are multilingual chain-of-thought reasoners , author=. The Eleventh International Conference on Learning Representations , year=
[38]

C ommonsense QA : A question answering challenge targeting commonsense knowledge

Talmor, Alon and Herzig, Jonathan and Lourie, Nicholas and Berant, Jonathan. C ommonsense QA : A Question Answering Challenge Targeting Commonsense Knowledge. Proceedings of the 2019 Conference of the North A merican Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). 2019. doi:10.18653/...

work page doi:10.18653/v1/n19-1421 2019
[39]

The Twelfth International Conference on Learning Representations , year=

Let's Verify Step by Step , author=. The Twelfth International Conference on Learning Representations , year=
[40]

Can a suit of armor conduct electricity? a new dataset for open book question answering

Mihaylov, Todor and Clark, Peter and Khot, Tushar and Sabharwal, Ashish. Can a Suit of Armor Conduct Electricity? A New Dataset for Open Book Question Answering. Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. 2018. doi:10.18653/v1/D18-1260

work page doi:10.18653/v1/d18-1260 2018
[41]

2021 , eprint =

Kazi, Samreen and Khoja, Shakeel , journal =. 2021 , eprint =

2021
[42]

Know What You Don ' t Know: Unanswerable Questions for SQ u AD

Rajpurkar, Pranav and Jia, Robin and Liang, Percy. Know What You Don ' t Know: Unanswerable Questions for SQ u AD. Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers). 2018. doi:10.18653/v1/P18-2124

work page doi:10.18653/v1/p18-2124 2018
[43]

doi:10.5281/zenodo.12608602 , url =

Gao, Leo and Tow, Jonathan and Abbasi, Baber and Biderman, Stella and Black, Sid and DiPofi, Anthony and Foster, Charles and Golding, Laurence and Hsu, Jeffrey and Le Noac'h, Alain and Li, Haonan and McDonell, Kyle and Muennighoff, Niklas and Ociepa, Chris and Phang, Jason and Reynolds, Laria and Schoelkopf, Hailey and Skowron, Aviya and Sutawika, Lintang...

work page doi:10.5281/zenodo.12608602
[44]

Aaditya Singh and Adam Fry and Adam Perelman and Adam Tart and Adi Ganesh and Ahmed El-Kishky and Aidan McLaughlin and Aiden Low and AJ Ostrow and Akhila Ananthram and Akshay Nathan and Alan Luo and Alec Helyar and Aleksander Madry and Aleksandr Efremov and Aleksandra Spyra and Alex Baker-Whitcomb and Alex Beutel and Alex Karpenko and Alex Makelov and Ale...

Pith/arXiv arXiv
[45]

U rdu F act C heck: An Agentic Fact-Checking Framework for U rdu with Evidence Boosting and Benchmarking

Ahmad, Sarfraz and Iqbal, Hasan and Ahsan, Momina and Naeem, Numaan and Khan, Muhammad Ahsan Riaz and Riaz, Arham and Manzoor, Muhammad Arslan and Wang, Yuxia and Nakov, Preslav. U rdu F act C heck: An Agentic Fact-Checking Framework for U rdu with Evidence Boosting and Benchmarking. Findings of the Association for Computational Linguistics: EMNLP 2025. 2...

work page doi:10.18653/v1/2025.findings-emnlp.1240 2025

[1] [1]

Shafique, Muhammad and Mehboob, Areej and Fiaz, Layba and Qadeer, Muhammad and Farooq, Hamza , year =

[2] [2]

ACM Trans

Kazi, Samreen and Khoja, Shakeel , title =. ACM Trans. Asian Low-Resour. Lang. Inf. Process. , month = feb, articleno =. 2026 , issue_date =. doi:10.1145/3759455 , abstract =

work page doi:10.1145/3759455 2026

[3] [3]

Crossing Language Boundaries: Evaluation of Large Language Models on U rdu- E nglish Question Answering

Kazi, Samreen and Rahim, Maria and Khoja, Shakeel Ahmed. Crossing Language Boundaries: Evaluation of Large Language Models on U rdu- E nglish Question Answering. Proceedings of the First Workshop on Natural Language Processing for Indo-Aryan and Dravidian Languages. 2025

2025

[4] [4]

Adeeba, Farah and Dillon, Brian and Sajjad, Hassan and Bhatt, Rajesh , year =

[5] [5]

Benchmarking the Performance of Pre-trained LLM s across U rdu NLP Tasks

Tahir, Munief Hassan and Shams, Sana and Fiaz, Layba and Adeeba, Farah and Hussain, Sarmad. Benchmarking the Performance of Pre-trained LLM s across U rdu NLP Tasks. Proceedings of the First Workshop on Challenges in Processing South Asian Languages (CHiPSAL 2025). 2025

2025

[6] [6]

International Conference on Learning Representations , year=

Measuring Massive Multitask Language Understanding , author=. International Conference on Learning Representations , year=

[7] [7]

Proceedings of the 38th International Conference on Neural Information Processing Systems , articleno =

Wang, Yubo and Ma, Xueguang and Zhang, Ge and Ni, Yuansheng and Chandra, Abhranil and Guo, Shiguang and Ren, Weiming and Arulraj, Aaran and He, Xuan and Jiang, Ziyan and Li, Tianle and Ku, Max and Wang, Kai and Zhuang, Alex and Fan, Rongqi and Yue, Xiang and Chen, Wenhu , title =. Proceedings of the 38th International Conference on Neural Information Proc...

2024

[8] [8]

MMLU - CF : A Contamination-free Multi-task Language Understanding Benchmark

Zhao, Qihao and Huang, Yangyu and Lv, Tengchao and Cui, Lei and Sun, Qinzheng and Mao, Shaoguang and Zhang, Xin and Xin, Ying and Yin, Qiufeng and Li, Scarlett and Wei, Furu. MMLU - CF : A Contamination-free Multi-task Language Understanding Benchmark. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long ...

work page doi:10.18653/v1/2025.acl-long.656 2025

[9] [9]

Global MMLU : Understanding and Addressing Cultural and Linguistic Biases in Multilingual Evaluation

Singh, Shivalika and Romanou, Angelika and Fourrier, Cl \'e mentine and Adelani, David Ifeoluwa and Ngui, Jian Gang and Vila-Suero, Daniel and Limkonchotiwat, Peerat and Marchisio, Kelly and Leong, Wei Qi and Susanto, Yosephine and Ng, Raymond and Longpre, Shayne and Ruder, Sebastian and Ko, Wei-Yin and Bosselut, Antoine and Oh, Alice and Martins, Andre a...

work page doi:10.18653/v1/2025.acl-long.919 2025

[10] [10]

2501.15747 , archivePrefix =

Sankalp KJ and Ashutosh Kumar and Laxmaan Balaji and Nikunj Kotecha and Vinija Jain and Aman Chadha and Sreyoshi Bhaduri , year =. 2501.15747 , archivePrefix =

arXiv

[11] [11]

MMLU - P ro X : A Multilingual Benchmark for Advanced Large Language Model Evaluation

Xuan, Weihao and Yang, Rui and Qi, Heli and Zeng, Qingcheng and Xiao, Yunze and Feng, Aosong and Liu, Dairui and Xing, Yun and Wang, Junjue and Gao, Fan and Lu, Jinghui and Jiang, Yuang and Li, Huitao and Li, Xin and Yu, Kunyu and Dong, Ruihai and Gu, Shangding and Li, Yuekang and Xie, Xiaofei and Juefei-Xu, Felix and Khomh, Foutse and Yoshie, Osamu and C...

work page doi:10.18653/v1/2025.emnlp-main.79 2025

[12] [12]

Angelika Romanou and Negar Foroutan and Anna Sotnikova and Sree Harsha Nelaturu and Shivalika Singh and Rishabh Maheshwary and Micol Altomare and Zeming Chen and Mohamed A. Haggag and Snegha A and Alfonso Amayuelas and Azril Hafizi Amirudin and Danylo Boiko and Michael Chang and Jenny Chim and Gal Cohen and Aditya Kumar Dalmia and Abraham Diress and Shara...

[13] [13]

EXAMS : A Multi-subject High School Examinations Dataset for Cross-lingual and Multilingual Question Answering

Hardalov, Momchil and Mihaylov, Todor and Zlatkova, Dimitrina and Dinkov, Yoan and Koychev, Ivan and Nakov, Preslav. EXAMS : A Multi-subject High School Examinations Dataset for Cross-lingual and Multilingual Question Answering. Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). 2020. doi:10.18653/v1/2020.emnlp...

work page doi:10.18653/v1/2020.emnlp-main.438 2020

[14] [14]

MILU : A Multi-task I ndic Language Understanding Benchmark

Verma, Sshubam and Khan, Mohammed Safi Ur Rahman and Kumar, Vishwajeet and Murthy, Rudra and Sen, Jaydeep. MILU : A Multi-task I ndic Language Understanding Benchmark. Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers). 2025. doi:10...

work page doi:10.18653/v1/2025.naacl-long.507 2025

[15] [15]

A rabic MMLU : Assessing Massive Multitask Language Understanding in A rabic

Koto, Fajri and Li, Haonan and Shatnawi, Sara and Doughman, Jad and Sadallah, Abdelrahman and Alraeesi, Aisha and Almubarak, Khalid and Alyafeai, Zaid and Sengupta, Neha and Shehata, Shady and Habash, Nizar and Nakov, Preslav and Baldwin, Timothy. A rabic MMLU : Assessing Massive Multitask Language Understanding in A rabic. Findings of the Association for...

work page doi:10.18653/v1/2024.findings-acl.334 2024

[16] [16]

CMMLU : Measuring massive multitask language understanding in C hinese

Li, Haonan and Zhang, Yixuan and Koto, Fajri and Yang, Yifei and Zhao, Hai and Gong, Yeyun and Duan, Nan and Baldwin, Timothy. CMMLU : Measuring massive multitask language understanding in C hinese. Findings of the Association for Computational Linguistics: ACL 2024. 2024. doi:10.18653/v1/2024.findings-acl.671

work page doi:10.18653/v1/2024.findings-acl.671 2024

[17] [17]

Large Language Models Only Pass Primary School Exams in I ndonesia: A Comprehensive Test on I ndo MMLU

Koto, Fajri and Aisyah, Nurul and Li, Haonan and Baldwin, Timothy. Large Language Models Only Pass Primary School Exams in I ndonesia: A Comprehensive Test on I ndo MMLU. Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing. 2023. doi:10.18653/v1/2023.emnlp-main.760

work page doi:10.18653/v1/2023.emnlp-main.760 2023

[18] [18]

KMMLU : Measuring Massive Multitask Language Understanding in K orean

Son, Guijin and Lee, Hanwool and Kim, Sungdong and Kim, Seungone and Muennighoff, Niklas and Choi, Taekyoon and Park, Cheonbok and Yoo, Kang Min and Biderman, Stella. KMMLU : Measuring Massive Multitask Language Understanding in K orean. Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguist...

work page doi:10.18653/v1/2025.naacl-long.206 2025

[19] [19]

K az MMLU : Evaluating Language Models on K azakh, R ussian, and Regional Knowledge of K azakhstan

Togmanov, Mukhammed and Mukhituly, Nurdaulet and Turmakhan, Diana and Mansurov, Jonibek and Goloburda, Maiya and Sakip, Akhmed and Xie, Zhuohan and Wang, Yuxia and Syzdykov, Bekassyl and Laiyk, Nurkhan and Aji, Alham Fikri and Kochmar, Ekaterina and Nakov, Preslav and Koto, Fajri. K az MMLU : Evaluating Language Models on K azakh, R ussian, and Regional K...

work page doi:10.18653/v1/2025.acl-long.701 2025

[20] [20]

Alif: Advancing U rdu Large Language Models via Multilingual Synthetic Data Distillation

Shafique, Muhammad Ali and Mehreen, Kanwal and Arham, Muhammad and Amjad, Maaz and Butt, Sabur and Farooq, Hamza. Alif: Advancing U rdu Large Language Models via Multilingual Synthetic Data Distillation. Proceedings of the 5th Workshop on Multilingual Representation Learning (MRL 2025). 2025. doi:10.18653/v1/2025.mrl-main.19

work page doi:10.18653/v1/2025.mrl-main.19 2025

[21] [21]

2026 , eprint=

Ministral 3 , author=. 2026 , eprint=

2026

[22] [22]

Qalb: Largest State-of-the-Art

Muhammad Taimoor Hassan and Jawad Ahmed and Muhammad Awais , year=. Qalb: Largest State-of-the-Art. 2601.08141 , archivePrefix=

arXiv

[23] [23]

2505.09388 , archivePrefix=

An Yang and Anfeng Li and Baosong Yang and Beichen Zhang and Binyuan Hui and Bo Zheng and Bowen Yu and Chang Gao and Chengen Huang and Chenxu Lv and Chujie Zheng and Dayiheng Liu and Fan Zhou and Fei Huang and Feng Hu and Hao Ge and Haoran Wei and Huan Lin and Jialong Tang and Jian Yang and Jianhong Tu and Jianwei Zhang and Jianxin Yang and Jiaxi Yang and...

Pith/arXiv arXiv

[24] [24]

Crosslingual Generalization through Multitask Finetuning

Muennighoff, Niklas and Wang, Thomas and Sutawika, Lintang and Roberts, Adam and Biderman, Stella and Le Scao, Teven and Bari, M Saiful and Shen, Sheng and Yong, Zheng Xin and Schoelkopf, Hailey and Tang, Xiangru and Radev, Dragomir and Aji, Alham Fikri and Almubarak, Khalid and Albanie, Samuel and Alyafeai, Zaid and Webson, Albert and Raff, Edward and Ra...

work page doi:10.18653/v1/2023.acl-long.891 2023

[25] [25]

arXiv preprint arXiv:2503.19786 , year =

Gemma 3 Technical Report , author =. arXiv preprint arXiv:2503.19786 , year =

Pith/arXiv arXiv

[26] [26]

arXiv preprint arXiv:2403.08295 , year =

Gemma: Open Models Based on Gemini Research and Technology , author =. arXiv preprint arXiv:2403.08295 , year =

Pith/arXiv arXiv

[27] [27]

2026 , month =

Anthropic , title =. 2026 , month =

2026

[28] [28]

2026 , howpublished =

Gemma 4 Model Card , author =. 2026 , howpublished =

2026

[29] [29]

Aaron Grattafiori and Abhimanyu Dubey and Abhinav Jauhri and Abhinav Pandey and Abhishek Kadian and Ahmad Al-Dahle and Aiesha Letman and Akhil Mathur and Alan Schelten and Alex Vaughan and Amy Yang and Angela Fan and Anirudh Goyal and Anthony Hartshorn and Aobo Yang and Archi Mitra and Archie Sravankumar and Artem Korenev and Arthur Hinsvark and Arun Rao ...

Pith/arXiv arXiv

[30] [30]

2024 , howpublished =

2024

[31] [31]

2025 , howpublished =

2025

[32] [32]

Marah Abdin and Jyoti Aneja and Hany Awadalla and Ahmed Awadallah and Ammar Ahmad Awan and Nguyen Bach and Amit Bahree and Arash Bakhtiari and Jianmin Bao and Harkirat Behl and Alon Benhaim and Misha Bilenko and Johan Bjorck and Sébastien Bubeck and Martin Cai and Qin Cai and Vishrav Chaudhary and Dong Chen and Dongdong Chen and Weizhu Chen and Yen-Chun C...

Pith/arXiv arXiv

[33] [33]

2503.01743 , archivePrefix=

Microsoft and : and Abdelrahman Abouelenin and Atabak Ashfaq and Adam Atkinson and Hany Awadalla and Nguyen Bach and Jianmin Bao and Alon Benhaim and Martin Cai and Vishrav Chaudhary and Congcong Chen and Dong Chen and Dongdong Chen and Junkun Chen and Weizhu Chen and Yen-Chun Chen and Yi-ling Chen and Qi Dai and Xiyang Dai and Ruchao Fan and Mei Gao and ...

Pith/arXiv arXiv

[34] [34]

2026 , howpublished =

2026

[35] [35]

2026 , howpublished =

Introducing. 2026 , howpublished =

2026

[36] [36]

Gemma Team and Morgane Riviere and Shreya Pathak and Pier Giuseppe Sessa and Cassidy Hardin and Surya Bhupatiraju and Léonard Hussenot and Thomas Mesnard and Bobak Shahriari and Alexandre Ramé and Johan Ferret and Peter Liu and Pouya Tafti and Abe Friesen and Michelle Casbon and Sabela Ramos and Ravin Kumar and Charline Le Lan and Sammy Jerome and Anton T...

Pith/arXiv arXiv

[37] [37]

The Eleventh International Conference on Learning Representations , year=

Language models are multilingual chain-of-thought reasoners , author=. The Eleventh International Conference on Learning Representations , year=

[38] [38]

C ommonsense QA : A question answering challenge targeting commonsense knowledge

Talmor, Alon and Herzig, Jonathan and Lourie, Nicholas and Berant, Jonathan. C ommonsense QA : A Question Answering Challenge Targeting Commonsense Knowledge. Proceedings of the 2019 Conference of the North A merican Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). 2019. doi:10.18653/...

work page doi:10.18653/v1/n19-1421 2019

[39] [39]

The Twelfth International Conference on Learning Representations , year=

Let's Verify Step by Step , author=. The Twelfth International Conference on Learning Representations , year=

[40] [40]

Can a suit of armor conduct electricity? a new dataset for open book question answering

Mihaylov, Todor and Clark, Peter and Khot, Tushar and Sabharwal, Ashish. Can a Suit of Armor Conduct Electricity? A New Dataset for Open Book Question Answering. Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. 2018. doi:10.18653/v1/D18-1260

work page doi:10.18653/v1/d18-1260 2018

[41] [41]

2021 , eprint =

Kazi, Samreen and Khoja, Shakeel , journal =. 2021 , eprint =

2021

[42] [42]

Know What You Don ' t Know: Unanswerable Questions for SQ u AD

Rajpurkar, Pranav and Jia, Robin and Liang, Percy. Know What You Don ' t Know: Unanswerable Questions for SQ u AD. Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers). 2018. doi:10.18653/v1/P18-2124

work page doi:10.18653/v1/p18-2124 2018

[43] [43]

doi:10.5281/zenodo.12608602 , url =

Gao, Leo and Tow, Jonathan and Abbasi, Baber and Biderman, Stella and Black, Sid and DiPofi, Anthony and Foster, Charles and Golding, Laurence and Hsu, Jeffrey and Le Noac'h, Alain and Li, Haonan and McDonell, Kyle and Muennighoff, Niklas and Ociepa, Chris and Phang, Jason and Reynolds, Laria and Schoelkopf, Hailey and Skowron, Aviya and Sutawika, Lintang...

work page doi:10.5281/zenodo.12608602

[44] [44]

Aaditya Singh and Adam Fry and Adam Perelman and Adam Tart and Adi Ganesh and Ahmed El-Kishky and Aidan McLaughlin and Aiden Low and AJ Ostrow and Akhila Ananthram and Akshay Nathan and Alan Luo and Alec Helyar and Aleksander Madry and Aleksandr Efremov and Aleksandra Spyra and Alex Baker-Whitcomb and Alex Beutel and Alex Karpenko and Alex Makelov and Ale...

Pith/arXiv arXiv

[45] [45]

U rdu F act C heck: An Agentic Fact-Checking Framework for U rdu with Evidence Boosting and Benchmarking

Ahmad, Sarfraz and Iqbal, Hasan and Ahsan, Momina and Naeem, Numaan and Khan, Muhammad Ahsan Riaz and Riaz, Arham and Manzoor, Muhammad Arslan and Wang, Yuxia and Nakov, Preslav. U rdu F act C heck: An Agentic Fact-Checking Framework for U rdu with Evidence Boosting and Benchmarking. Findings of the Association for Computational Linguistics: EMNLP 2025. 2...

work page doi:10.18653/v1/2025.findings-emnlp.1240 2025