pith. machine review for the scientific record. sign in

arxiv: 2504.19314 · v2 · pith:PQRUAG7Cnew · submitted 2025-04-27 · 💻 cs.CL

BrowseComp-ZH: Benchmarking Web Browsing Ability of Large Language Models in Chinese

Pith reviewed 2026-05-17 22:00 UTC · model grok-4.3

classification 💻 cs.CL
keywords LLM agentsweb browsingChinese webmulti-hop questionsbenchmarkinformation retrievalagent evaluation
0
0 comments X

The pith

A new benchmark shows most LLMs score below 20% when browsing the Chinese web for verifiable facts.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces BrowseComp-ZH, a benchmark of 289 multi-hop questions across 11 domains that tests LLM agents on real-time Chinese web browsing. Questions are reverse-engineered from short verifiable answers such as dates or proper nouns, then filtered through a two-stage quality control process to ensure high difficulty and unique answers. Evaluation of over twenty state-of-the-art models and agentic systems finds most achieve under 10% accuracy, only a few exceed 20%, and the strongest system reaches just 42.9%. The results indicate that success requires effective retrieval combined with sophisticated reasoning to reconcile information amid linguistic and infrastructural complexities of the Chinese web. The dataset and guidelines are released publicly to support further work on non-English agent capabilities.

Core claim

BrowseComp-ZH consists of 289 multi-hop questions spanning 11 diverse domains on the Chinese web. Each question is reverse-engineered from a short, objective, and easily verifiable answer. A two-stage quality control protocol is applied to strive for high question difficulty and answer uniqueness. Benchmarking shows most models achieve accuracy rates below 10%, with only a handful exceeding 20% and the best-performing system reaching 42.9%. These outcomes demonstrate that current models still struggle to master the retrieval strategies, reasoning, and information reconciliation needed for reliable Chinese web browsing.

What carries the argument

BrowseComp-ZH benchmark of 289 reverse-engineered multi-hop questions with two-stage quality control for testing LLM web-browsing agents on the Chinese web.

Load-bearing premise

The two-stage quality control protocol produces questions that are genuinely high-difficulty and have unique verifiable answers without hidden shortcuts or English leakage.

What would settle it

A model achieving over 70% accuracy on the released questions, or an independent review finding that many questions admit multiple valid answers or contain English-based shortcuts, would challenge the claim of substantial model limitations.

read the original abstract

As large language models (LLMs) evolve into tool-using agents, the ability to browse the web in real-time has become a critical yardstick for measuring their reasoning and retrieval competence. Existing benchmarks such as BrowseComp concentrate on English and overlook the linguistic, infrastructural, and censorship-related complexities of other major information ecosystems -- most notably Chinese. To address this gap, we introduce BrowseComp-ZH, a high-difficulty benchmark purpose-built to comprehensively evaluate LLM agents on the Chinese web. BrowseComp-ZH consists of 289 multi-hop questions spanning 11 diverse domains. Each question is reverse-engineered from a short, objective, and easily verifiable answer (e.g., a date, number, or proper noun). A two-stage quality control protocol is applied to strive for high question difficulty and answer uniqueness. We benchmark over 20 state-of-the-art language models and agentic search systems on our proposed BrowseComp-ZH. Despite their strong conversational and retrieval capabilities, most models struggle severely: a large number achieve accuracy rates below 10%, and only a handful exceed 20%. Even the best-performing system, OpenAI's DeepResearch, reaches just 42.9%. These results demonstrate the considerable difficulty of BrowseComp-ZH, where success demands not only effective retrieval strategies, but also sophisticated reasoning and information reconciliation -- capabilities that current models still struggle to master. Our dataset, construction guidelines, and benchmark results have been publicly released at https://github.com/PALIN2018/BrowseComp-ZH.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The paper introduces BrowseComp-ZH, a benchmark of 289 multi-hop questions spanning 11 domains for evaluating LLM agents on Chinese web browsing. Questions are reverse-engineered from short verifiable answers (e.g., dates or proper nouns), with a two-stage quality control protocol applied to promote high difficulty and answer uniqueness. Over 20 models and agentic systems are benchmarked, with most achieving below 10% accuracy and the top system (OpenAI's DeepResearch) reaching 42.9%. The dataset, guidelines, and results are publicly released.

Significance. If the benchmark questions prove robust, this work meaningfully extends English-centric evaluations like BrowseComp by highlighting limitations in retrieval, reasoning, and information reconciliation on the Chinese web, including linguistic and infrastructural factors. The public release of the 289 questions and construction guidelines is a clear strength that enables reproducibility and community extensions. The reported performance ceiling provides a concrete, falsifiable baseline for progress in multilingual agent capabilities.

major comments (1)
  1. [§3.2–3.3] §3.2–3.3: The two-stage quality control protocol is described in qualitative terms, but no quantitative metrics are reported, such as inter-annotator agreement scores, error rates on answer uniqueness, or statistics on how questions were filtered for difficulty. This is load-bearing for the central claim, as the low model accuracies (max 42.9%) and assertions of high difficulty, multi-hop structure, and absence of shortcuts or English leakage rest directly on the effectiveness of this protocol.
minor comments (2)
  1. [Abstract] The abstract states that 'a large number achieve accuracy rates below 10%' and 'only a handful exceed 20%'; adding a reference to the specific table or figure with exact counts would improve precision and allow readers to verify the distribution without searching the results section.
  2. [Evaluation section] Consider clarifying in the evaluation protocol whether answer verification during benchmarking allows partial credit or requires exact matches, as this directly affects interpretation of the 42.9% result for DeepResearch.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address the major comment point by point below and indicate the planned revisions.

read point-by-point responses
  1. Referee: [§3.2–3.3] §3.2–3.3: The two-stage quality control protocol is described in qualitative terms, but no quantitative metrics are reported, such as inter-annotator agreement scores, error rates on answer uniqueness, or statistics on how questions were filtered for difficulty. This is load-bearing for the central claim, as the low model accuracies (max 42.9%) and assertions of high difficulty, multi-hop structure, and absence of shortcuts or English leakage rest directly on the effectiveness of this protocol.

    Authors: We agree that the current description of the two-stage quality control protocol in Sections 3.2 and 3.3 is primarily qualitative and that quantitative metrics would provide stronger substantiation for our claims of high difficulty, multi-hop structure, answer uniqueness, and lack of shortcuts or English leakage. In the revised manuscript, we will add the following: (1) inter-annotator agreement scores (e.g., Cohen's kappa or percentage agreement) from the verification stage; (2) statistics on the filtering process, including the number of initial candidate questions, the proportion discarded at each stage for insufficient difficulty or non-uniqueness, and the specific criteria applied (such as answer verifiability and multi-hop requirements); and (3) any available error rates related to answer uniqueness checks. These additions will directly support the central claims and the reported performance ceiling of 42.9%. revision: yes

Circularity Check

0 steps flagged

No significant circularity

full rationale

This is a benchmark construction and evaluation paper with no claimed mathematical derivation chain, equations, fitted parameters, or predictions. Questions are reverse-engineered from verifiable answers and filtered via a two-stage QC protocol (Sections 3.2–3.3), but this process is described procedurally without reducing to self-definition or self-citation. Model accuracies are measured empirically on external systems against the released 289-question set; no load-bearing self-citation, uniqueness theorem, or ansatz is invoked to justify the central claims. The work is self-contained against external benchmarks and reports.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the assumption that the 289 questions are both difficult and uniquely answerable. No free parameters are introduced. No new entities are postulated.

axioms (1)
  • domain assumption Web pages in Chinese can be accessed and parsed by the tested agents in the same manner as English pages.
    Implicit in the benchmark design and evaluation protocol.

pith-pipeline@v0.9.0 · 5624 in / 1117 out tokens · 22360 ms · 2026-05-17T22:00:29.433365+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 17 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Learning Agentic Policy from Action Guidance

    cs.CL 2026-05 unverdicted novelty 7.0

    ActGuide-RL uses human action data as plan-style guidance in mixed-policy RL to overcome exploration barriers in LLM agents, matching SFT+RL performance on search benchmarks without cold-start training.

  2. GAIA-v2-LILT: Multilingual Adaptation of Agent Benchmark beyond Translation

    cs.CL 2026-04 unverdicted novelty 7.0

    A new workflow for multilingual agent benchmark adaptation using functional, cultural, and difficulty alignments improves non-English agent success rates by up to 32.7% over simple machine translation, indicating subs...

  3. LongSeeker: Elastic Context Orchestration for Long-Horizon Search Agents

    cs.AI 2026-05 unverdicted novelty 6.0

    Context-ReAct enables agents to dynamically manage context via five atomic operations, and LongSeeker fine-tuned on 10k trajectories achieves 61.5% and 62.5% on BrowseComp benchmarks, outperforming prior agents.

  4. DR-Venus: Towards Frontier Edge-Scale Deep Research Agents with Only 10K Open Data

    cs.LG 2026-04 unverdicted novelty 6.0

    A 4B deep research agent trained on 10K open data outperforms prior agents under 9B parameters and narrows the gap to 30B-class systems on research benchmarks.

  5. GenericAgent: A Token-Efficient Self-Evolving LLM Agent via Contextual Information Density Maximization (V1.0)

    cs.CL 2026-04 unverdicted novelty 6.0

    GenericAgent outperforms other LLM agents on long-horizon tasks by maximizing context information density with fewer tokens via minimal tools, on-demand memory, trajectory-to-SOP evolution, and compression.

  6. MARCA: A Checklist-Based Benchmark for Multilingual Web Search

    cs.CL 2026-04 accept novelty 6.0

    MARCA is a bilingual benchmark using 52 questions and validated checklists to evaluate LLM web-search completeness and correctness in English and Portuguese.

  7. Towards Knowledgeable Deep Research: Framework and Benchmark

    cs.AI 2026-04 unverdicted novelty 6.0

    The paper introduces the KDR task, HKA multi-agent framework, and KDR-Bench to enable LLM agents to integrate structured knowledge into deep research reports, with experiments showing outperformance over prior agents.

  8. LightThinker++: From Reasoning Compression to Memory Management

    cs.CL 2026-04 unverdicted novelty 6.0

    LightThinker++ adds explicit adaptive memory management and a trajectory synthesis pipeline to LLM reasoning, cutting peak token use by ~70% while gaining accuracy in standard and long-horizon agent tasks.

  9. From Intent to Evidence: A Categorical Approach for Structural Evaluation of Deep Research Agents

    cs.LG 2026-03 unverdicted novelty 6.0

    A category theory framework evaluates deep research agents on structural skills and shows frontier systems reach only 19.9% accuracy on a new 296-question bilingual benchmark, with theory-guided interventions improvin...

  10. MiroThinker: Pushing the Performance Boundaries of Open-Source Research Agents via Model, Context, and Interactive Scaling

    cs.CL 2025-11 unverdicted novelty 6.0

    MiroThinker shows that scaling agent-environment interactions via reinforcement learning lets a 72B open-source model reach up to 81.9% on GAIA and approach commercial performance on research benchmarks.

  11. DeepResearch Bench: A Comprehensive Benchmark for Deep Research Agents

    cs.CL 2025-06 conditional novelty 6.0

    DeepResearch Bench supplies 100 expert-crafted PhD-level tasks and two human-aligned evaluation frameworks to measure deep research agents on report quality and citation accuracy.

  12. Mind DeepResearch Technical Report

    cs.AI 2026-04 unverdicted novelty 5.0

    MindDR combines a Planning Agent, DeepSearch Agent, and Report Agent with SFT cold-start, Search-RL, Report-RL, and preference alignment to reach competitive scores on research benchmarks using 30B-scale models.

  13. Seed1.8 Model Card: Towards Generalized Real-World Agency

    cs.AI 2026-03 unverdicted novelty 5.0

    Seed1.8 is a new foundation model that adds unified agentic capabilities for search, code execution, and GUI interaction to existing LLM and vision strengths.

  14. GLM-5: from Vibe Coding to Agentic Engineering

    cs.LG 2026-02 unverdicted novelty 5.0

    GLM-5 is a foundation model that claims state-of-the-art results on coding benchmarks and superior performance on end-to-end software engineering tasks via new asynchronous RL methods and cost-saving DSA.

  15. DeepSeek-V3.2: Pushing the Frontier of Open Large Language Models

    cs.CL 2025-12 unverdicted novelty 5.0

    DeepSeek-V3.2 adds sparse attention, scaled RL post-training, and large-scale agentic data synthesis to reach GPT-5-level performance and gold medals in 2025 IMO and IOI with its high-compute variant.

  16. UI-TARS-2 Technical Report: Advancing GUI Agent with Multi-Turn Reinforcement Learning

    cs.AI 2025-09 conditional novelty 5.0

    UI-TARS-2 reaches 88.2 on Online-Mind2Web, 47.5 on OSWorld, 50.6 on WindowsAgentArena, and 73.3 on AndroidWorld while attaining 59.8 mean normalized score on a 15-game suite through multi-turn RL and scalable data generation.

  17. Towards Trustworthy Report Generation: A Deep Research Agent with Progressive Confidence Estimation and Calibration

    cs.AI 2026-04 unverdicted novelty 4.0

    A deep research agent incorporates progressive confidence estimation and calibration to produce trustworthy reports with transparent confidence scores on claims.

Reference graph

Works this paper leans on

22 extracted references · 22 canonical work pages · cited by 17 Pith papers · 9 internal anchors

  1. [1]

    From Local to Global: A Graph RAG Approach to Query-Focused Summarization

    From local to global: A graph rag approach to query-focused summarization. arXiv preprint arXiv:2404.16130 (2024). Wenqi Fan, Yujuan Ding, Liangbo Ning, Shijie Wang, Hengyun Li, Dawei Yin, Tat-Seng Chua, and Qing Li

  2. [2]

    arXiv preprint arXiv:2407.12468 (2024)

    Search Engines, LLMs or Both? Evaluating Information Seeking Strategies for Answering Health Questions. arXiv preprint arXiv:2407.12468 (2024). Google. 2024a. Gemini 2.5: Our most intelligent AI model. https://blog.google/technology/ google-deepmind/gemini-model-thinking-updates-march-2025/ . Google. 2024b. Introducing Gemini 2.0: our new AI model for the...

  3. [3]

    DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

    Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning. arXiv preprint arXiv:2501.12948 (2025). Kelvin Guu, Kenton Lee, Zora Tung, Panupong Pasupat, and Mingwei Chang

  4. [4]

    arXiv preprint arXiv:2411.19478 (2024)

    Zero-Indexing Internet Search Augmented Generation for Large Language Models. arXiv preprint arXiv:2411.19478 (2024). Chuanrui Hu, Shichong Xie, Baoxin Wang, Bin Chen, Xiaofeng Cong, and Jun Zhang

  5. [5]

    arXiv preprint arXiv:2502.15690 (2024)

    Level- Navi Agent: A Framework and benchmark for Chinese Web Search Agents. arXiv preprint arXiv:2502.15690 (2024). Mandar Joshi, Eunsol Choi, Daniel S Weld, and Luke Zettlemoyer

  6. [6]

    TriviaQA: A Large Scale Distantly Supervised Challenge Dataset for Reading Comprehension

    Triviaqa: A large scale distantly supervised challenge dataset for reading comprehension.arXiv preprint arXiv:1705.03551 (2017). Hanyu Lai, Xiao Liu, Iat Long Iong, Shuntian Yao, Yuxuan Chen, Pengbo Shen, Hao Yu, Hanchen Zhang, Xiaohan Zhang, Yuxiao Dong, et al

  7. [7]

    ACM Transactions on Information Systems (2025)

    WebGLM: Towards an Efficient and Reliable Web-Enhanced Question Answering System. ACM Transactions on Information Systems (2025). Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, et al

  8. [8]

    Advances in neural information processing systems 33 (2020), 9459–9474

    Retrieval-augmented generation for knowledge-intensive nlp tasks. Advances in neural information processing systems 33 (2020), 9459–9474. 10 Junyi Li, Tianyi Tang, Wayne Xin Zhao, Jingyuan Wang, Jian-Yun Nie, and Ji-Rong Wen. 2023a. The web can be your oyster for improving large language models. arXiv preprint arXiv:2305.10998 (2023). Jianquan Li, Xidong ...

  9. [9]

    DeepSeek-V3 Technical Report

    Deepseek-v3 technical report. arXiv preprint arXiv:2412.19437 (2024). Junling Liu, Peilin Zhou, Yining Hua, Dading Chong, Zhongyu Tian, Andrew Liu, Helin Wang, Chenyu You, Zhenhua Guo, Lei Zhu, et al

  10. [10]

    Advances in Neural Information Processing Systems 36 (2023), 52430–52452

    Benchmarking large language models on cmexam- a comprehensive chinese medical exam dataset. Advances in Neural Information Processing Systems 36 (2023), 52430–52452. Yuanjie Lyu, Zhiyu Li, Simin Niu, Feiyu Xiong, Bo Tang, Wenjin Wang, Hao Wu, Huanyong Liu, Tong Xu, and Enhong Chen

  11. [11]

    ACM Transactions on Information Systems 43, 2 (2025), 1–32

    Crud-rag: A comprehensive chinese benchmark for retrieval- augmented generation of large language models. ACM Transactions on Information Systems 43, 2 (2025), 1–32. Andreas Martin, Hans Friedrich Witschel, Maximilian Mandl, and Mona Stockhecke

  12. [12]

    WebGPT: Browser-assisted question-answering with human feedback

    Webgpt: Browser-assisted question-answering with human feedback. arXiv preprint arXiv:2112.09332 (2021). OpenAI. 2024a. hello-gpt-4o. https://openai.com/index/hello-gpt-4o/. OpenAI. 2024b. Introducing OpenAI o1. https://openai.com/o1/. OpenAI. 2025a. Introducing deep research. https://openai.com/index/ introducing-deep-research/. OpenAI. 2025b. Introducin...

  13. [13]

    arXiv preprint arXiv:2009.02252 (2020)

    KILT: a benchmark for knowledge intensive language tasks. arXiv preprint arXiv:2009.02252 (2020). Qwen_Team

  14. [14]

    Advances in Neural Information Processing Systems 36 (2023), 68539–68551

    Toolformer: Language models can teach themselves to use tools. Advances in Neural Information Processing Systems 36 (2023), 68539–68551. 11 Tencent

  15. [15]

    FEVER: a large-scale dataset for Fact Extraction and VERification

    FEVER: a large-scale dataset for fact extraction and VERification. arXiv preprint arXiv:1803.05355 (2018). Tu Vu, Mohit Iyyer, Xuezhi Wang, Noah Constant, Jerry Wei, Jason Wei, Chris Tar, Yun-Hsuan Sung, Denny Zhou, Quoc Le, et al

  16. [16]

    Freshllms: Refreshing large language models with search engine augmentation

    Freshllms: Refreshing large language models with search engine augmentation. arXiv preprint arXiv:2310.03214 (2023). Jason Wei, Zhiqing Sun, Spencer Papay, Scott McKinney, Jeffrey Han, Isa Fulford, Hyung Won Chung, Alex Tachard Passos, William Fedus, and Amelia Glaese

  17. [17]

    BrowseComp: A Simple Yet Challenging Benchmark for Browsing Agents

    BrowseComp: A Simple Yet Challenging Benchmark for Browsing Agents. arXiv preprint arXiv:2504.12516 (2025). Nirmalie Wiratunga, Ramitha Abeyratne, Lasal Jayawardena, Kyle Martin, Stewart Massie, Ikechukwu Nkisi-Orji, Ruvan Weerasinghe, Anne Liret, and Bruno Fleisch

  18. [18]

    Science China Information Sciences 68, 2 (2025), 121101

    The rise and potential of large language model based agents: A survey. Science China Information Sciences 68, 2 (2025), 121101. Haoyi Xiong, Jiang Bian, Yuchen Li, Xuhong Li, Mengnan Du, Shuaiqiang Wang, Dawei Yin, and Sumi Helal

  19. [19]

    IEEE Transactions on Services Computing (2024)

    When search engine services meet large language models: visions and challenges. IEEE Transactions on Services Computing (2024). Zhikun Xu, Yinghui Li, Ruixue Ding, Xinyu Wang, Boli Chen, Yong Jiang, Hai-Tao Zheng, Wenlian Lu, Pengjun Xie, and Fei Huang

  20. [20]

    arXiv preprint arXiv:2402.19248 (2024)

    Let llms take on the latest challenges! a chinese dynamic question answering benchmark. arXiv preprint arXiv:2402.19248 (2024). An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, et al

  21. [21]

    Qwen2.5 Technical Report

    Qwen2. 5 technical report. arXiv preprint arXiv:2412.15115 (2024). Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio, William W Cohen, Ruslan Salakhutdinov, and Christopher D Manning

  22. [22]

    HotpotQA: A Dataset for Diverse, Explainable Multi-hop Question Answering

    HotpotQA: A dataset for diverse, explainable multi-hop question answering. arXiv preprint arXiv:1809.09600 (2018). Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao