{"paper":{"title":"BrowseComp-ZH: Benchmarking Web Browsing Ability of Large Language Models in Chinese","license":"http://arxiv.org/licenses/nonexclusive-distrib/1.0/","headline":"A new benchmark shows most LLMs score below 20% when browsing the Chinese web for verifiable facts.","cross_cats":[],"primary_cat":"cs.CL","authors_text":"Bruce Leon, Can Zhang, Chao Liu, Chenxuan Xie, Dading Chong, Jian Chen, Jing Ren, Meng Cao, Peilin Zhou, Qichen Ye, Sixin Hong, Xiang Ying, Yifan Shao, Yining Hua, Yuxin Gu, Zhiling Jin","submitted_at":"2025-04-27T17:32:43Z","abstract_excerpt":"As large language models (LLMs) evolve into tool-using agents, the ability to browse the web in real-time has become a critical yardstick for measuring their reasoning and retrieval competence. Existing benchmarks such as BrowseComp concentrate on English and overlook the linguistic, infrastructural, and censorship-related complexities of other major information ecosystems -- most notably Chinese. To address this gap, we introduce BrowseComp-ZH, a high-difficulty benchmark purpose-built to comprehensively evaluate LLM agents on the Chinese web. BrowseComp-ZH consists of 289 multi-hop questions"},"claims":{"count":4,"items":[{"kind":"strongest_claim","text":"Despite their strong conversational and retrieval capabilities, most models struggle severely: a large number achieve accuracy rates below 10%, and only a handful exceed 20%. Even the best-performing system, OpenAI's DeepResearch, reaches just 42.9%.","source":"verdict.strongest_claim","status":"machine_extracted","claim_id":"C1","attestation":"unclaimed"},{"kind":"weakest_assumption","text":"The two-stage quality control protocol produces questions that are genuinely high-difficulty and have unique verifiable answers without hidden shortcuts or English leakage.","source":"verdict.weakest_assumption","status":"machine_extracted","claim_id":"C2","attestation":"unclaimed"},{"kind":"one_line_summary","text":"BrowseComp-ZH is a new benchmark of 289 Chinese web questions where even the strongest LLM agents reach only 42.9% accuracy.","source":"verdict.one_line_summary","status":"machine_extracted","claim_id":"C3","attestation":"unclaimed"},{"kind":"headline","text":"A new benchmark shows most LLMs score below 20% when browsing the Chinese web for verifiable facts.","source":"verdict.pith_extraction.headline","status":"machine_extracted","claim_id":"C4","attestation":"unclaimed"}],"snapshot_sha256":"8b91367cedd1151e988a93a3ab926b1a8ea0f079d0822601428f304833147c91"},"source":{"id":"2504.19314","kind":"arxiv","version":2},"verdict":{"id":"7eb41fbb-f7fb-4605-9b6a-57a1e01a539a","model_set":{"reader":"grok-4.3"},"created_at":"2026-05-17T22:00:29.433365Z","strongest_claim":"Despite their strong conversational and retrieval capabilities, most models struggle severely: a large number achieve accuracy rates below 10%, and only a handful exceed 20%. Even the best-performing system, OpenAI's DeepResearch, reaches just 42.9%.","one_line_summary":"BrowseComp-ZH is a new benchmark of 289 Chinese web questions where even the strongest LLM agents reach only 42.9% accuracy.","pipeline_version":"pith-pipeline@v0.9.0","weakest_assumption":"The two-stage quality control protocol produces questions that are genuinely high-difficulty and have unique verifiable answers without hidden shortcuts or English leakage.","pith_extraction_headline":"A new benchmark shows most LLMs score below 20% when browsing the Chinese web for verifiable facts."},"references":{"count":22,"sample":[{"doi":"","year":2024,"title":"From Local to Global: A Graph RAG Approach to Query-Focused Summarization","work_id":"588618d7-fd41-4053-b34d-a981f8793039","ref_index":1,"cited_arxiv_id":"2404.16130","is_internal_anchor":true},{"doi":"","year":2024,"title":"arXiv preprint arXiv:2407.12468 (2024)","work_id":"7ada9217-2b2f-4c16-b3de-cf30543e9d91","ref_index":2,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2025,"title":"DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning","work_id":"e6b75ad5-2877-4168-97c8-710407094d20","ref_index":3,"cited_arxiv_id":"2501.12948","is_internal_anchor":true},{"doi":"","year":2024,"title":"arXiv preprint arXiv:2411.19478 (2024)","work_id":"09001285-da58-4291-a9e0-7a70f63c178c","ref_index":4,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2024,"title":"arXiv preprint arXiv:2502.15690 (2024)","work_id":"686e0d3a-0189-43ad-b81b-5834bff6f19a","ref_index":5,"cited_arxiv_id":"","is_internal_anchor":false}],"resolved_work":22,"snapshot_sha256":"c9683a91134875816e47c533896d325a5845394104ea7076ddf0ab38d7e5db36","internal_anchors":9},"formal_canon":{"evidence_count":2,"snapshot_sha256":"a02427b97d89a009fa5133f2531977962cdfeeff6c8c89ca57ffdeed24570937"},"author_claims":{"count":0,"strong_count":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57"},"builder_version":"pith-number-builder-2026-05-17-v1"}