Advances in Neural Information Processing Systems 36 (2023), 52430–52452

Benchmarking large language models on cmexama comprehensive chinese medical exam dataset · 2023

1 Pith paper cite this work. Polarity classification is still indexing.

1 Pith paper citing it

browse 1 citing papers

representative citing papers

cs.CL · 2025-04-27 · conditional · novelty 7.0

BrowseComp-ZH is a new benchmark of 289 Chinese web questions where even the strongest LLM agents reach only 42.9% accuracy.

Showing 1 of 1 citing paper.

BrowseComp-ZH: Benchmarking Web Browsing Ability of Large Language Models in Chinese cs.CL · 2025-04-27 · conditional · none · ref 10
BrowseComp-ZH is a new benchmark of 289 Chinese web questions where even the strongest LLM agents reach only 42.9% accuracy.