OpenCompass: A Universal Evaluation Platform for Large Language Models

arxiv: 2605.19276 · v1 · pith:CKIX5LFSnew · submitted 2026-05-19 · 💻 cs.CL · cs.LG

OpenCompass: A Universal Evaluation Platform for Large Language Models

Maosong Cao , Kai Chen , Haodong Duan , Yixiao Fang , Tong Gao , Ge Jiaye , Mo Li , Hongwei Liu

show 20 more authors

Junnan Liu Yuan Liu Chengqi Lyu Han Lyu Ningsheng Ma Zerun Ma Yu Sun Zhiyong Wu Linchen Xiao Jun Xu Haochen Ye Zhaohui Yu Yike Yuan Songyang Zhang Yufeng Zhao Fengzhe Zhou Peiheng Zhou Dongsheng Zhu Lin Zhu Jingming Zhuo

This is my paper

Pith reviewed 2026-05-20 06:13 UTC · model grok-4.3

classification 💻 cs.CL cs.LG

keywords LLM evaluationevaluation platformlarge language modelsbenchmarkingmodular architecturehigh concurrencymodel assessment

0 comments p. Extension

pith:CKIX5LFS Add to your LaTeX paper

What is a Pith Number?

\usepackage{pith}
\pithnumber{CKIX5LFS}

Prints a linked pith:CKIX5LFS badge after your title and writes the identifier into PDF metadata. Compiles on arXiv with no extra files. Learn more

The pith

OpenCompass is a one-stop modular platform for scalable high-concurrency evaluation of large language models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes OpenCompass to overcome limitations in current LLM evaluation, such as inconsistent criteria, fragmented workflows, and difficulty scaling across many tasks and models. It presents the platform as a unified tool that supports benchmarks in knowledge, reasoning, code, and other domains while allowing different evaluation styles. The design relies on separating concerns into distinct modules so the system can adapt to new models and scenarios without major rewrites. If the approach works as described, researchers and developers could run large numbers of tests more efficiently and compare models on a common footing.

Core claim

OpenCompass is a one-stop, scalable, and high-concurrency-supported general-purpose LLM evaluation platform. Adhering to the design philosophy of modularization and component decoupling, the platform boasts three core advantages: high compatibility, flexibility, and high concurrency. The core architecture of OpenCompass comprises five key components: the Configuration System, Task Partitioning Module, Execution and Scheduling Module, Task Execution Unit, and Result Visualization Module. Its workflow provides rule-based, LLM-as-a-Judge, and cascaded evaluators to adapt to the requirements of different task scenarios.

What carries the argument

The five-component modular architecture (Configuration System, Task Partitioning Module, Execution and Scheduling Module, Task Execution Unit, Result Visualization Module) that separates configuration, partitioning, scheduling, execution, and visualization to support compatibility and concurrency.

If this is right

Supports mainstream benchmark datasets across domains including knowledge, reasoning, computation, science, language, and code.
Provides rule-based, LLM-as-a-Judge, and cascaded evaluators to match different task needs.
Offers a single interface for both academic and industrial users to assess model capabilities.
Enables more efficient identification of LLM strengths and weaknesses to guide further development.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Widespread use could reduce the current variation in how different groups measure the same model capabilities.
The architecture might allow quick addition of new benchmarks or models without rebuilding the entire system.
Consistent tooling could make cross-paper comparisons of LLM performance more reliable over time.
The platform could serve as a base for automated loops that test, analyze, and suggest improvements to models.

Load-bearing premise

The modular component-decoupled architecture will deliver high compatibility, flexibility, and high concurrency in practice across diverse LLM evaluation scenarios.

What would settle it

Running the platform on a workload with dozens of large models evaluated concurrently and observing whether it maintains stable throughput and correct results without custom code changes for each model.

read the original abstract

In recent years, the field of artificial intelligence has undergone a paradigm shift from task-specific small-scale models to general-purpose large language models (LLMs). With the rapid iteration of LLMs, objective, quantitative, and comprehensive evaluation of their capabilities has become a critical link in advancing technological development. Currently, the mainstream static benchmark dataset-based evaluation methods face challenges such as the diversity of task types, inconsistent evaluation criteria, and fragmentation of data and processing workflows, making it difficult to efficiently conduct cross-domain and large-scale model evaluation. To address the aforementioned issues, this paper proposes and open-sources OpenCompass, a one-stop, scalable, and high-concurrency-supported general-purpose LLM evaluation platform. Adhering to the design philosophy of modularization and component decoupling, the platform boasts three core advantages: high compatibility, flexibility, and high concurrency. The core architecture of OpenCompass comprises five key components: the Configuration System, Task Partitioning Module, Execution and Scheduling Module, Task Execution Unit, and Result Visualization Module. Its workflow provides rule-based, LLM-as-a-Judge, and cascaded evaluators to adapt to the requirements of different task scenarios. Supporting mainstream benchmark datasets across multiple domains, including knowledge, reasoning, computation, science, language, code, etc., the platform offers a unified and efficient LLM evaluation tool for both academia and industry, facilitating the accurate identification of strengths and weaknesses of LLMs as well as their subsequent optimization.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

OpenCompass describes a modular evaluation platform but its high-concurrency claims need more evidence than the architecture sketch provides.

read the letter

The main things to know about this paper are that it introduces OpenCompass as an open-source LLM evaluation platform with a modular architecture, and that its claims for high concurrency lack concrete backing. It does a reasonable job describing the five modules and how they enable flexibility with different evaluators and datasets. Bringing rule-based and LLM-as-judge methods together under one roof is practical for users who need to switch between objective metrics and subjective judgments. The support for a broad set of domains like knowledge, computation, and language tasks shows good coverage. The architecture description alone does not establish the high concurrency at scale. The modules are named and their roles outlined, yet there is no explanation of specific mechanisms for task distribution across hardware or any reported performance data. This matches the stress-test concern and leaves the scalability advantage unverified. This paper is for practitioners in AI labs who evaluate many models and want a single platform instead of custom scripts. A reader dealing with inconsistent workflows would find the unified tool useful. I recommend sending it to peer review. The practical design and open-source nature make it worth referee time for a tools-oriented venue.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces OpenCompass, an open-source, one-stop platform for evaluating large language models. It describes a modular, component-decoupled architecture consisting of five elements—the Configuration System, Task Partitioning Module, Execution and Scheduling Module, Task Execution Unit, and Result Visualization Module—and a workflow supporting rule-based, LLM-as-Judge, and cascaded evaluators. The platform claims to handle mainstream benchmarks across knowledge, reasoning, computation, science, language, and code domains while delivering three core advantages: high compatibility, flexibility, and high concurrency, thereby addressing fragmentation and inefficiency in existing static benchmark evaluation methods.

Significance. If the architecture demonstrably achieves the claimed scalability and concurrency, OpenCompass could become a useful standardized tool that reduces duplication of effort in LLM evaluation for both academic and industrial users. The open-source release and support for diverse evaluator types and datasets constitute concrete strengths that could aid reproducibility and extension. The significance remains provisional, however, because the manuscript supplies no empirical measurements of the performance advantages.

major comments (2)

[Core architecture and workflow sections] The central claim of high concurrency and scalability (stated in the abstract and the core architecture description) rests on the five-component design, yet the manuscript provides no concrete mechanisms—such as how the Execution and Scheduling Module partitions tasks across GPUs, manages model loading, or handles memory contention for concurrent inference—nor any quantitative results (throughput, scaling curves, or comparisons against sequential baselines). This absence directly undermines verification of the primary advantage.
[Evaluation / Experiments] No experimental section, table, or figure reports benchmarks that would substantiate the flexibility and high-concurrency assertions (e.g., wall-clock time or resource utilization when evaluating multiple models on heterogeneous tasks). Without such data the advantages remain design assertions rather than demonstrated properties.

minor comments (2)

[Abstract] The abstract would benefit from an explicit statement of open-source repository location and a short list of supported benchmark suites to orient readers immediately.
[Workflow description] The term 'cascaded evaluators' is introduced without a brief definition or illustrative workflow diagram; adding one would improve clarity for readers unfamiliar with the concept.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive report. We appreciate the acknowledgment of OpenCompass's modular design, open-source release, and support for multiple evaluator types as strengths that can aid reproducibility. Below we respond point-by-point to the major comments. We agree that empirical evidence is needed to substantiate the concurrency and scalability claims and will incorporate the requested material in the revised manuscript.

read point-by-point responses

Referee: [Core architecture and workflow sections] The central claim of high concurrency and scalability (stated in the abstract and the core architecture description) rests on the five-component design, yet the manuscript provides no concrete mechanisms—such as how the Execution and Scheduling Module partitions tasks across GPUs, manages model loading, or handles memory contention for concurrent inference—nor any quantitative results (throughput, scaling curves, or comparisons against sequential baselines). This absence directly undermines verification of the primary advantage.

Authors: We acknowledge that the current manuscript presents the high-level five-component architecture but does not detail the internal mechanisms of the Execution and Scheduling Module for GPU partitioning, model loading, or memory contention handling, nor does it include quantitative performance data. In the revised version we will expand the relevant sections to describe these mechanisms explicitly (including task distribution logic and concurrency controls) and add throughput measurements, scaling curves, and comparisons against sequential baselines to substantiate the high-concurrency claims. revision: yes
Referee: [Evaluation / Experiments] No experimental section, table, or figure reports benchmarks that would substantiate the flexibility and high-concurrency assertions (e.g., wall-clock time or resource utilization when evaluating multiple models on heterogeneous tasks). Without such data the advantages remain design assertions rather than demonstrated properties.

Authors: We agree that the absence of an experimental section leaves the claimed advantages as design assertions rather than demonstrated results. The original submission focused on platform architecture and workflow; we will add a dedicated Experiments section in the revision that reports wall-clock time, resource utilization, and comparative benchmarks when evaluating multiple models across heterogeneous tasks, thereby providing the quantitative evidence requested. revision: yes

Circularity Check

0 steps flagged

No circularity: descriptive software platform paper with no derivations or predictions

full rationale

The paper is a direct description of the OpenCompass LLM evaluation platform, its five-component modular architecture (Configuration System, Task Partitioning Module, Execution and Scheduling Module, Task Execution Unit, Result Visualization Module), and supported evaluators. No mathematical derivations, equations, fitted parameters, predictions, or first-principles results are present. Claims of high compatibility, flexibility, and high concurrency are presented as design outcomes of the modular decoupling rather than derived quantities that reduce to inputs by construction. No self-citations, uniqueness theorems, or ansatzes are invoked in a load-bearing way. The derivation chain is absent, making the paper self-contained as an engineering artifact description.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The platform rests on standard software engineering assumptions about modularity and the availability of existing benchmark datasets; no free parameters or invented entities are introduced.

axioms (2)

domain assumption Modular and decoupled component design improves compatibility, flexibility, and concurrency in evaluation systems.
Invoked in the description of the core architecture and design philosophy.
domain assumption Mainstream benchmark datasets across knowledge, reasoning, computation, science, language, and code domains are sufficient to evaluate LLM capabilities comprehensively.
Stated in the support for multiple domains without additional justification.

pith-pipeline@v0.9.0 · 5889 in / 1274 out tokens · 36694 ms · 2026-05-20T06:13:24.869424+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

23 extracted references · 23 canonical work pages · 8 internal anchors

[1]

Longbench: A bilingual, multitask benchmark for long context understanding

Yushi Bai, Xin Lv, Jiajie Zhang, Hongchang Lyu, Jiankai Tang, Zhidian Huang, Zhengxiao Du, Xiao Liu, Aohan Zeng, Lei Hou, et al. Longbench: A bilingual, multitask benchmark for long context understanding. InProceedings of the 62nd annual meeting of the association for computational linguistics (volume 1: Long papers), pages 3119–3137, 2024. 4.7

work page 2024
[2]

Arc prize 2024: Technical report

Francois Chollet, Mike Knoop, Gregory Kamradt, and Bryan Landers. Arc prize 2024: Technical report. arXiv preprint arXiv:2412.04604, 2024. 4.8

work page arXiv 2024
[3]

Lmdeploy: A toolkit for compressing, deploying, and serving llm.https: //github.com/InternLM/lmdeploy, 2023

LMDeploy Contributors. Lmdeploy: A toolkit for compressing, deploying, and serving llm.https: //github.com/InternLM/lmdeploy, 2023. 3.3

work page 2023
[4]

MMEngine: Openmmlab foundational library for training deep learning models

MMEngine Contributors. MMEngine: Openmmlab foundational library for training deep learning models

work page
[5]

Physics: Benchmarking foundation models on university-level physics problem solving

Kaiyue Feng, Yilun Zhao, Yixin Liu, Tianyu Yang, Chen Zhao, John Sous, and Arman Cohan. Physics: Benchmarking foundation models on university-level physics problem solving. InProceedings of the 61st Annual Meeting of the Association for Computational Linguistics (ACL), 2023. 4.4

work page 2023
[7]

Measuring Massive Multitask Language Understanding

Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Stein- hardt. Measuring massive multitask language understanding.arXiv preprint arXiv:2009.03300, 2020. 4.5

work page internal anchor Pith review Pith/arXiv arXiv 2009
[8]

Measuring mathematical problem solving with the math dataset

Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the math dataset. InProceedings of the 61st Annual Meeting of the Association for Computational Linguistics (ACL), 2021. 4.3

work page 2021
[9]

RULER: What's the Real Context Size of Your Long-Context Language Models?

Cheng-Ping Hsieh, Simeng Sun, Samuel Kriman, Shantanu Acharya, Dima Rekesh, Fei Jia, Yang Zhang, and Boris Ginsburg. Ruler: What’s the real context size of your long-context language models?arXiv preprint arXiv:2404.06654, 2024. 4.7

work page internal anchor Pith review Pith/arXiv arXiv 2024
[10]

LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code

Naman Jain, King Han, Alex Gu, Wen-Ding Li, Fanjia Yan, Tianjun Zhang, Sida Wang, Armando Solar- Lezama, Koushik Sen, and Ion Stoica. Livecodebench: Holistic and contamination free evaluation of large language models for code.arXiv preprint arXiv:2403.07974, 2024. 4.6

work page internal anchor Pith review Pith/arXiv arXiv 2024
[11]

Gonzalez, Hao Zhang, and Ion Stoica

Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large language model serving with pagedattention. InProceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles,

work page
[12]

Climaqa: An automated evaluation framework for climate question answering models.arXiv preprint arXiv:2410.16701, 2024

Veeramakali Vignesh Manivannan, Yasaman Jafari, Srikar Eranky, Spencer Ho, Rose Yu, Duncan Watson- Parris, Yian Ma, Leon Bergen, and Taylor Berg-Kirkpatrick. Climaqa: An automated evaluation framework for climate question answering models.arXiv preprint arXiv:2410.16701, 2024. 4.4

work page arXiv 2024
[13]

Humanity's Last Exam

Long Phan, Alice Gatti, Ziwen Han, Nathaniel Li, Josephina Hu, Hugh Zhang, Chen Bo Calvin Zhang, Mohamed Shaaban, John Ling, Sean Shi, et al. Humanity’s last exam.arXiv preprint arXiv:2501.14249,

work page internal anchor Pith review Pith/arXiv arXiv
[14]

Generalizing Verifiable Instruction Following

Valentina Pyatkin, Saumya Malik, Victoria Graf, Hamish Ivison, Shengyi Huang, Pradeep Dasigi, Nathan Lambert, and Hannaneh Hajishirzi. Generalizing verifiable instruction following.arXiv preprint arXiv:2507.02833, 2025. 4.8

work page internal anchor Pith review Pith/arXiv arXiv 2025
[15]

Gpqa: A graduate-level google-proof q&a benchmark

David Rein, Betty Li Hou, Asa Cooper Stickland, Jackson Petty, Richard Yuanzhe Pang, Julien Dirani, Julian Michael, and Samuel R Bowman. Gpqa: A graduate-level google-proof q&a benchmark. InFirst Conference on Language Modeling, 2024. 1, 4.1 9 OpenCompass: A Universal Evaluation Platform for Large Language Models

work page 2024
[16]

Challenging big-bench tasks and whether chain-of-thought can solve them

Mirac Suzgun, Nathan Scales, Nathanael Schärli, Sebastian Gehrmann, Yi Tay, Hyung Won Chung, Aakanksha Chowdhery, Quoc Le, Ed Chi, Denny Zhou, et al. Challenging big-bench tasks and whether chain-of-thought can solve them. InFindings of the Association for Computational Linguistics: ACL 2023, pages 13003–13051, 2023. 4.2

work page 2023
[17]

Measuring short-form factuality in large language models

Jason Wei, Nguyen Karina, Hyung Won Chung, Yunxin Joy Jiao, Spencer Papay, Amelia Glaese, John Schulman, and William Fedus. Measuring short-form factuality in large language models.arXiv preprint arXiv:2411.04368, 2024. 4.1

work page internal anchor Pith review Pith/arXiv arXiv 2024
[18]

Openicl: An open-source framework for in-context learning

Zhenyu Wu, Yaoxiang Wang, Jiacheng Ye, Zhiyong Wu, Jiangtao Feng, Jingjing Xu, and Yu Qiao. Openicl: An open-source framework for in-context learning. InProceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 3: System Demonstrations), pages 489–498, 2023. 3.1

work page 2023
[19]

Llasmol: Advancing large language models for chemistry with a large-scale, comprehensive, high-quality instruction tuning dataset.arXiv preprint arXiv:2402.09391, 2024

Botao Yu, Frazier N Baker, Ziqi Chen, Xia Ning, and Huan Sun. Llasmol: Advancing large language models for chemistry with a large-scale, comprehensive, high-quality instruction tuning dataset.arXiv preprint arXiv:2402.09391, 2024. 4.4

work page arXiv 2024
[20]

Hellaswag: Can a machine really finish your sentence? InProceedings of the 61st Annual Meeting of the Association for Computational Linguistics (ACL), 2019

Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali Farhadi, and Yejin Choi. Hellaswag: Can a machine really finish your sentence? InProceedings of the 61st Annual Meeting of the Association for Computational Linguistics (ACL), 2019. 4.2

work page 2019
[21]

P-mmeval: A parallel multilingual multitask benchmark for consistent evaluation of llms

Yidan Zhang, Yu Wan, Boyi Deng, Baosong Yang, Hao-Ran Wei, Fei Huang, Bowen Yu, Dayiheng Liu, Junyang Lin, and Jingren Zhou. P-mmeval: A parallel multilingual multitask benchmark for consistent evaluation of llms. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2025. 4.5

work page 2025
[22]

Instruction-Following Evaluation for Large Language Models

Jeffrey Zhou, Tianjian Lu, Swaroop Mishra, Siddhartha Brahma, Sujoy Basu, Yi Luan, Denny Zhou, and Le Hou. Instruction-following evaluation for large language models.arXiv preprint arXiv:2311.07911,

work page internal anchor Pith review Pith/arXiv arXiv
[23]

BigCodeBench: Benchmarking Code Generation with Diverse Function Calls and Complex Instructions

Terry Yue Zhuo, Minh Chien Vu, Jenny Chim, Han Hu, Wenhao Yu, Ratnadira Widyasari, Imam Nur Bani Yusuf, Haolan Zhan, Junda He, Indraneil Paul, et al. Bigcodebench: Benchmarking code generation with diverse function calls and complex instructions.arXiv preprint arXiv:2406.15877, 2024. 4.6 10 OpenCompass: A Universal Evaluation Platform for Large Language M...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[24]

The results are sourced fromhttps://rank.opencompass.org.cn/leaderboard-llm-academic/ ?m=REALTIME Table1: Model Performance Benchmarks Models Average IFEval HLE GPQA diamond AIME 2025 MMLU- Pro LiveCode BenchV6 Gemini-3-Pro-Preview 81.32 92.79 37.98 91.54 93.44 89.31 82.86 GLM-5-FP8 78.98 93.16 28.13 85.35 95.83 85.23 86.19 GPT-5-2025-08-07 (high) 78.84 9...

work page 2025

[1] [1]

Longbench: A bilingual, multitask benchmark for long context understanding

Yushi Bai, Xin Lv, Jiajie Zhang, Hongchang Lyu, Jiankai Tang, Zhidian Huang, Zhengxiao Du, Xiao Liu, Aohan Zeng, Lei Hou, et al. Longbench: A bilingual, multitask benchmark for long context understanding. InProceedings of the 62nd annual meeting of the association for computational linguistics (volume 1: Long papers), pages 3119–3137, 2024. 4.7

work page 2024

[2] [2]

Arc prize 2024: Technical report

Francois Chollet, Mike Knoop, Gregory Kamradt, and Bryan Landers. Arc prize 2024: Technical report. arXiv preprint arXiv:2412.04604, 2024. 4.8

work page arXiv 2024

[3] [3]

Lmdeploy: A toolkit for compressing, deploying, and serving llm.https: //github.com/InternLM/lmdeploy, 2023

LMDeploy Contributors. Lmdeploy: A toolkit for compressing, deploying, and serving llm.https: //github.com/InternLM/lmdeploy, 2023. 3.3

work page 2023

[4] [4]

MMEngine: Openmmlab foundational library for training deep learning models

MMEngine Contributors. MMEngine: Openmmlab foundational library for training deep learning models

work page

[5] [5]

Physics: Benchmarking foundation models on university-level physics problem solving

Kaiyue Feng, Yilun Zhao, Yixin Liu, Tianyu Yang, Chen Zhao, John Sous, and Arman Cohan. Physics: Benchmarking foundation models on university-level physics problem solving. InProceedings of the 61st Annual Meeting of the Association for Computational Linguistics (ACL), 2023. 4.4

work page 2023

[6] [7]

Measuring Massive Multitask Language Understanding

Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Stein- hardt. Measuring massive multitask language understanding.arXiv preprint arXiv:2009.03300, 2020. 4.5

work page internal anchor Pith review Pith/arXiv arXiv 2009

[7] [8]

Measuring mathematical problem solving with the math dataset

Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the math dataset. InProceedings of the 61st Annual Meeting of the Association for Computational Linguistics (ACL), 2021. 4.3

work page 2021

[8] [9]

RULER: What's the Real Context Size of Your Long-Context Language Models?

Cheng-Ping Hsieh, Simeng Sun, Samuel Kriman, Shantanu Acharya, Dima Rekesh, Fei Jia, Yang Zhang, and Boris Ginsburg. Ruler: What’s the real context size of your long-context language models?arXiv preprint arXiv:2404.06654, 2024. 4.7

work page internal anchor Pith review Pith/arXiv arXiv 2024

[9] [10]

LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code

Naman Jain, King Han, Alex Gu, Wen-Ding Li, Fanjia Yan, Tianjun Zhang, Sida Wang, Armando Solar- Lezama, Koushik Sen, and Ion Stoica. Livecodebench: Holistic and contamination free evaluation of large language models for code.arXiv preprint arXiv:2403.07974, 2024. 4.6

work page internal anchor Pith review Pith/arXiv arXiv 2024

[10] [11]

Gonzalez, Hao Zhang, and Ion Stoica

Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large language model serving with pagedattention. InProceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles,

work page

[11] [12]

Climaqa: An automated evaluation framework for climate question answering models.arXiv preprint arXiv:2410.16701, 2024

Veeramakali Vignesh Manivannan, Yasaman Jafari, Srikar Eranky, Spencer Ho, Rose Yu, Duncan Watson- Parris, Yian Ma, Leon Bergen, and Taylor Berg-Kirkpatrick. Climaqa: An automated evaluation framework for climate question answering models.arXiv preprint arXiv:2410.16701, 2024. 4.4

work page arXiv 2024

[12] [13]

Humanity's Last Exam

Long Phan, Alice Gatti, Ziwen Han, Nathaniel Li, Josephina Hu, Hugh Zhang, Chen Bo Calvin Zhang, Mohamed Shaaban, John Ling, Sean Shi, et al. Humanity’s last exam.arXiv preprint arXiv:2501.14249,

work page internal anchor Pith review Pith/arXiv arXiv

[13] [14]

Generalizing Verifiable Instruction Following

Valentina Pyatkin, Saumya Malik, Victoria Graf, Hamish Ivison, Shengyi Huang, Pradeep Dasigi, Nathan Lambert, and Hannaneh Hajishirzi. Generalizing verifiable instruction following.arXiv preprint arXiv:2507.02833, 2025. 4.8

work page internal anchor Pith review Pith/arXiv arXiv 2025

[14] [15]

Gpqa: A graduate-level google-proof q&a benchmark

David Rein, Betty Li Hou, Asa Cooper Stickland, Jackson Petty, Richard Yuanzhe Pang, Julien Dirani, Julian Michael, and Samuel R Bowman. Gpqa: A graduate-level google-proof q&a benchmark. InFirst Conference on Language Modeling, 2024. 1, 4.1 9 OpenCompass: A Universal Evaluation Platform for Large Language Models

work page 2024

[15] [16]

Challenging big-bench tasks and whether chain-of-thought can solve them

Mirac Suzgun, Nathan Scales, Nathanael Schärli, Sebastian Gehrmann, Yi Tay, Hyung Won Chung, Aakanksha Chowdhery, Quoc Le, Ed Chi, Denny Zhou, et al. Challenging big-bench tasks and whether chain-of-thought can solve them. InFindings of the Association for Computational Linguistics: ACL 2023, pages 13003–13051, 2023. 4.2

work page 2023

[16] [17]

Measuring short-form factuality in large language models

Jason Wei, Nguyen Karina, Hyung Won Chung, Yunxin Joy Jiao, Spencer Papay, Amelia Glaese, John Schulman, and William Fedus. Measuring short-form factuality in large language models.arXiv preprint arXiv:2411.04368, 2024. 4.1

work page internal anchor Pith review Pith/arXiv arXiv 2024

[17] [18]

Openicl: An open-source framework for in-context learning

Zhenyu Wu, Yaoxiang Wang, Jiacheng Ye, Zhiyong Wu, Jiangtao Feng, Jingjing Xu, and Yu Qiao. Openicl: An open-source framework for in-context learning. InProceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 3: System Demonstrations), pages 489–498, 2023. 3.1

work page 2023

[18] [19]

Llasmol: Advancing large language models for chemistry with a large-scale, comprehensive, high-quality instruction tuning dataset.arXiv preprint arXiv:2402.09391, 2024

Botao Yu, Frazier N Baker, Ziqi Chen, Xia Ning, and Huan Sun. Llasmol: Advancing large language models for chemistry with a large-scale, comprehensive, high-quality instruction tuning dataset.arXiv preprint arXiv:2402.09391, 2024. 4.4

work page arXiv 2024

[19] [20]

Hellaswag: Can a machine really finish your sentence? InProceedings of the 61st Annual Meeting of the Association for Computational Linguistics (ACL), 2019

Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali Farhadi, and Yejin Choi. Hellaswag: Can a machine really finish your sentence? InProceedings of the 61st Annual Meeting of the Association for Computational Linguistics (ACL), 2019. 4.2

work page 2019

[20] [21]

P-mmeval: A parallel multilingual multitask benchmark for consistent evaluation of llms

Yidan Zhang, Yu Wan, Boyi Deng, Baosong Yang, Hao-Ran Wei, Fei Huang, Bowen Yu, Dayiheng Liu, Junyang Lin, and Jingren Zhou. P-mmeval: A parallel multilingual multitask benchmark for consistent evaluation of llms. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2025. 4.5

work page 2025

[21] [22]

Instruction-Following Evaluation for Large Language Models

Jeffrey Zhou, Tianjian Lu, Swaroop Mishra, Siddhartha Brahma, Sujoy Basu, Yi Luan, Denny Zhou, and Le Hou. Instruction-following evaluation for large language models.arXiv preprint arXiv:2311.07911,

work page internal anchor Pith review Pith/arXiv arXiv

[22] [23]

BigCodeBench: Benchmarking Code Generation with Diverse Function Calls and Complex Instructions

Terry Yue Zhuo, Minh Chien Vu, Jenny Chim, Han Hu, Wenhao Yu, Ratnadira Widyasari, Imam Nur Bani Yusuf, Haolan Zhan, Junda He, Indraneil Paul, et al. Bigcodebench: Benchmarking code generation with diverse function calls and complex instructions.arXiv preprint arXiv:2406.15877, 2024. 4.6 10 OpenCompass: A Universal Evaluation Platform for Large Language M...

work page internal anchor Pith review Pith/arXiv arXiv 2024

[23] [24]

The results are sourced fromhttps://rank.opencompass.org.cn/leaderboard-llm-academic/ ?m=REALTIME Table1: Model Performance Benchmarks Models Average IFEval HLE GPQA diamond AIME 2025 MMLU- Pro LiveCode BenchV6 Gemini-3-Pro-Preview 81.32 92.79 37.98 91.54 93.44 89.31 82.86 GLM-5-FP8 78.98 93.16 28.13 85.35 95.83 85.23 86.19 GPT-5-2025-08-07 (high) 78.84 9...

work page 2025