Recoverable Identifier
advisory
doi_compliance
recoverable_identifier
DOI in the printed bibliography is fragmented by whitespace or line breaks. A longer candidate (10.18653/v1/2024.findings-naacl.149.URL) was visible in the surrounding text but could not be confirmed against doi.org as printed.
Paper page Integrity report arXiv Try DOI
Evidence text
URL https://arxiv.org/abs/2506.05176. Y. Zhang, M. Konomi, C. Xypolopoulos, K. Divriotis, K. Skianis, G. Nikolentzos, G. Stamou, G. Shang, and M. Vazirgiannis. Greekmmlu: A native-sourced multitask benchmark for evaluating language models in greek, 2026. URL https://arxiv.org/abs/2602.05150. W. Zhong, R. Cui, Y. Guo, Y. Liang, S. Lu, Y. Wang, A. Saied, W. Chen, and N. Duan. AGIEval: A human-centric benchmark for evaluating foundation models. In K. Duh, H. Gomez, and S. Bethard, editors,Findings of the Association for Computational Linguistics: NAACL 2024, pages 2299–2314, Mexico City, Mexico, June 2024. Association for Computational Linguistics. doi: 10.18653/v1/ 2024.findings-naacl.149. URL https://aclanthology.org/2024.findings-naacl.149/. F. Zhou, Z. Wang, N. Ranjan, Z. Cheng, L. Tang, G. He, Z. Liu, and E. P. Xing. Megamath: Pushing the limits of open math corpora.arXiv preprint arXiv:2504.02807, 2025. Preprint. Z. Zhu, C. Xie, X. Lv, and slime Contributors. slime: An llm post-training framework for rl scaling. https://github.com/THUDM/slime, 2025. GitHub repository. Corresponding author: Xin Lv. B. Zoph, I. Bello, S. Kumar, N. Du, Y. Huang, J. Dean, N. Shazeer, and W. Fedus. St-moe: Designing stable and transferable sparse expert models, 2022. URL https://arxiv.org/abs/2202.08906. 36 Marco-MoE : Open Multilingual Mixture-of-Expert Language Models with Efficient Upcycling A. Per-Benchmark Results across Pre-training Phases Benchmark (Metric) Stage-1 Stage-2 Stage-3 Stage-
Evidence payload
{
"printed_excerpt": "URL https://arxiv.org/abs/2506.05176. Y. Zhang, M. Konomi, C. Xypolopoulos, K. Divriotis, K. Skianis, G. Nikolentzos, G. Stamou, G. Shang, and M. Vazirgiannis. Greekmmlu: A native-sourced multitask benchmark for evaluating language models i",
"reconstructed_doi": "10.18653/v1/2024.findings-naacl.149.URL",
"ref_index": 10,
"resolved_title": null,
"verdict_class": "incontrovertible"
}