pith. sign in

arxiv: 2304.12986 · v3 · pith:7EJGAANZnew · submitted 2023-04-25 · 💻 cs.CL · cs.AI

Measuring Massive Multitask Chinese Understanding

classification 💻 cs.CL cs.AI
keywords modelsaccuracyzero-shotacrosschinesehighestmedicinesubtasks
0
0 comments X
read the original abstract

The development of large-scale Chinese language models is flourishing, yet there is a lack of corresponding capability assessments. Therefore, we propose a test to measure the multitask accuracy of large Chinese language models. This test encompasses four major domains, including medicine, law, psychology, and education, with 15 subtasks in medicine and 8 subtasks in education. We found that the best-performing models in the zero-shot setting outperformed the worst-performing models by nearly 18.6 percentage points on average. Across the four major domains, the highest average zero-shot accuracy of all models is 0.512. In the subdomains, only the GPT-3.5-turbo model achieved a zero-shot accuracy of 0.693 in clinical medicine, which was the highest accuracy among all models across all subtasks. All models performed poorly in the legal domain, with the highest zero-shot accuracy reaching only 0.239. By comprehensively evaluating the breadth and depth of knowledge across multiple disciplines, this test can more accurately identify the shortcomings of the models.

This paper has not been read by Pith yet.

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. TaxPraBen: A Scalable Benchmark for Structured Evaluation of LLMs in Chinese Real-World Tax Practice

    cs.CL 2026-04 unverdicted novelty 7.0

    TaxPraBen is a new benchmark with 14 datasets and a structured evaluation method for measuring LLM performance on Chinese real-world tax tasks and scenarios.

  2. Retrieval-Augmented Generation for Large Language Models: A Survey

    cs.CL 2023-12 unverdicted novelty 3.0

    A survey of RAG paradigms, components, benchmarks, and challenges for improving LLMs on knowledge-intensive tasks.