CLEVA: Chinese Language Models EVAluation Platform

Dahua Lin; Duo Zheng; Jianqiao Zhao; Liwei Wang; Michael R. Lyu; Shijia Huang; Xiaohui Su; Yanyang Li; Yongfeng Huang; Zhi Chen

arxiv: 2308.04813 · v2 · pith:MYWQKVUNnew · submitted 2023-08-09 · 💻 cs.CL

CLEVA: Chinese Language Models EVAluation Platform

Yanyang Li , Jianqiao Zhao , Duo Zheng , Zi-Yuan Hu , Zhi Chen , Xiaohui Su , Yongfeng Huang , Shijia Huang

show 3 more authors

Dahua Lin Michael R. Lyu Liwei Wang

This is my paper

classification 💻 cs.CL

keywords chinesellmsclevaevaluationmodelplatformcontaminationevaluate

0 comments

read the original abstract

With the continuous emergence of Chinese Large Language Models (LLMs), how to evaluate a model's capabilities has become an increasingly significant issue. The absence of a comprehensive Chinese benchmark that thoroughly assesses a model's performance, the unstandardized and incomparable prompting procedure, and the prevalent risk of contamination pose major challenges in the current evaluation of Chinese LLMs. We present CLEVA, a user-friendly platform crafted to holistically evaluate Chinese LLMs. Our platform employs a standardized workflow to assess LLMs' performance across various dimensions, regularly updating a competitive leaderboard. To alleviate contamination, CLEVA curates a significant proportion of new data and develops a sampling strategy that guarantees a unique subset for each leaderboard round. Empowered by an easy-to-use interface that requires just a few mouse clicks and a model API, users can conduct a thorough evaluation with minimal coding. Large-scale experiments featuring 23 Chinese LLMs have validated CLEVA's efficacy.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

SCAN: Structured Capability Assessment and Navigation for LLMs
cs.CL 2025-05 unverdicted novelty 5.0

SCAN is a framework for fine-grained LLM capability assessment via automatic taxonomy construction from queries, query synthesis for coverage, visualization tools, and a PC2-enhanced LLM-as-a-judge method, applied to ...
TrustLLM: Trustworthiness in Large Language Models
cs.CL 2024-01 unverdicted novelty 5.0

TrustLLM defines eight trustworthiness principles, creates a six-dimension benchmark, and evaluates 16 LLMs showing proprietary models generally lead but some open-source ones are close while over-calibration can hurt...