pith. sign in

arxiv: 2605.21413 · v1 · pith:FJYSECMSnew · submitted 2026-05-20 · 💻 cs.AI

Teaching AI Through Benchmark Construction: QuestBench as a Course-Based Practice for Accountable Knowledge Work

Pith reviewed 2026-05-21 04:07 UTC · model grok-4.3

classification 💻 cs.AI
keywords AI educationbenchmark constructionQuestBenchdeep research systemsaccountable knowledge workstudent-designed questionsAI evaluationhumanities and social sciences
0
0 comments X

The pith

Building benchmarks in class teaches students to judge AI and shows that current deep research systems fail most expert-level questions.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes that AI education should move beyond teaching students to use AI tools and instead include constructing benchmarks to test AI outputs. In this course practice, students draw on their disciplinary knowledge to create verifiable expert-level questions, review one another's designs to remove ambiguity and shortcuts, and then evaluate multiple AI systems on the resulting tasks. The produced QuestBench contains 256 questions across 14 humanities and social-science domains. Evaluation of thirteen systems yields a mean pass rate of only 16.85 percent, with the top system reaching 57.58 percent, and the failures highlight cases where fluent answers still miss the right query, source, term, or evidence standard. Student reflections indicate that the activity helps them view professional knowledge as the basis for judging AI rather than content AI simply retrieves.

Core claim

Students transform their disciplinary expertise into 256 verifiable questions across 14 domains to form QuestBench. After peer review to remove ambiguity and shortcuts, evaluation of thirteen AI deep research systems shows a mean question-level pass rate of 16.85 percent, with GPT-5.5 achieving the highest at 57.58 percent. Failures often involve missing the precise query, appropriate source, key term, or evidence standard despite fluent and sourced answers. Student reflections indicate that constructing the benchmark shifts their view of knowledge from something AI retrieves to the foundation for judging AI outputs.

What carries the argument

The course-based benchmark construction process, where students design expert-level questions, conduct peer review for ambiguity and shortcuts, and evaluate AI systems on the resulting tasks.

Load-bearing premise

That student peer review sufficiently removes ambiguity and shortcuts from the questions so that AI failures reflect actual system limitations rather than question flaws.

What would settle it

Independent domain experts reviewing all 256 QuestBench questions and confirming they are clear and solvable by skilled humans, followed by re-testing the same thirteen systems and finding persistently low pass rates, would support the claim of genuine system limitations.

Figures

Figures reproduced from arXiv: 2605.21413 by Chongyang Pan, Haiyang Shen, Jiuzheng Wang, Mugeng Liu, Siqi Zhong, Taian Guo, Weichen Bi, Wenchun Jing, Xiaoying Bai, Yudong Han, Yun Ma, Zhiyang Chen.

Figure 1
Figure 1. Figure 1: Conceptual framework of QUESTBENCH as course-based benchmark construction for teaching accountable AI-mediated knowledge work. Students first encounter deep research systems as a practical tool, then use benchmark construction to design expert-level questions, test shortcuts, validate answers, evaluate models, and analyze failures. The course links tool exposure with question design, disciplinary standards… view at source ↗
Figure 2
Figure 2. Figure 2: Course and technical pipeline for QUESTBENCH. Students transform disciplinary knowledge into expert-level question packages, then filter them through preliminary screening, answer verification, grading￾criteria audit, anti-shortcut validation, and domain normalization. The same artifacts are then used for model evaluation, scoring, and failure analysis, turning task design into practice in accountable AI-m… view at source ↗
Figure 3
Figure 3. Figure 3: Left: Distribution across normalized domain groups. Right: Empirical cross-model question pass-rate [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Left: score distributions across models. Right: average tool calls and fraction of runs exceeding 50 [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Domain and error analysis. Left: mean scores for the largest normalized domains across the thirteen [PITH_FULL_IMAGE:figures/full_fig_p010_5.png] view at source ↗
read the original abstract

As AI becomes part of everyday learning, many courses teach students to use it mainly as a productivity tool: how to prompt, search, summarize, write, code, and use tools more efficiently. We argue that AI education also needs a setting in which students learn to test AI and understand their own role in judging machine-produced knowledge. To this end, we introduce a course-based practice that teaches AI through benchmark construction, using deep research systems as a concrete example of AI-era knowledge work. Students turn disciplinary knowledge into verifiable expert-level questions, review one another's designs for ambiguity and shortcuts, and evaluate AI systems on the resulting tasks. This activity gives students direct exposure to a powerful tool while asking them to specify what a trustworthy answer would require. The produced benchmark, QuestBench, consists of 256 questions across 14 humanities and social-science domains. Evaluation on QuestBench shows that student-designed tasks reveal hidden failures in current deep research systems: across thirteen evaluated systems, the mean question-level pass rate is only 16.85%, and the best-performing system, GPT-5.5, reaches a 57.58% pass rate. The failures are educationally useful because they show how fluent, source-backed answers can still miss the right query, source, term, or evidence standard. Reflections from five student contributors suggest that benchmark construction can help students see professional knowledge not only as content AI may retrieve, but as the basis for judging AI outputs. We present QuestBench as a benchmark artifact and as a reusable classroom setting for a larger educational question: how students can remain responsible knowledge actors as AI enters learning and professional work. The dataset is available at https://huggingface.co/datasets/PKUAIWeb/QuestBench/tree/main.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces a course-based practice for teaching AI through benchmark construction, using deep research systems as the target. Students in humanities and social-science domains create 256 verifiable expert-level questions (QuestBench), peer-review one another's designs to remove ambiguity and shortcuts, and then evaluate 13 AI systems on the resulting tasks. The central empirical result is that current systems perform poorly, with a mean question-level pass rate of 16.85% and the best system (GPT-5.5) reaching only 57.58%. The authors argue that the failures are educationally useful, that the activity helps students see professional knowledge as a basis for judging AI outputs, and that QuestBench serves both as a public benchmark artifact and as a reusable classroom setting for accountable knowledge work.

Significance. If the questions are verifiably expert-level and free of design artifacts, the work supplies a concrete, replicable pedagogical intervention that shifts AI education from tool-use to critical evaluation. The public release of the 256-question dataset on Hugging Face is a clear strength for reproducibility and follow-on research. The reported performance gap also offers falsifiable evidence about the current limits of deep research systems on tasks that require precise source selection, terminology, and evidence standards.

major comments (2)
  1. [Benchmark construction and review process] The description of the peer-review process (the account of students reviewing one another's designs for ambiguity and shortcuts) provides no quantitative metrics: no inter-rater agreement scores, no revision or discard rates, no external expert validation, and no breakdown of how many of the 256 questions survived review. This detail is load-bearing for the headline claim that the 16.85% mean and 57.58% best-system pass rates reflect genuine system limitations rather than question-design artifacts.
  2. [Evaluation on QuestBench] The evaluation protocol is described at a high level but does not specify the exact scoring rubric, whether answers were judged by the question authors or independent raters, or how partial credit or source verification was handled. Without these details it is difficult to interpret the reported pass rates or to rule out post-hoc adjustments.
minor comments (2)
  1. [Introduction] The term 'deep research systems' is used repeatedly but never given an explicit operational definition or list of the thirteen evaluated models beyond the mention of GPT-5.5; a short table or footnote would improve clarity.
  2. [Data availability] The dataset link is provided, but the manuscript does not state whether the released files include the original student questions, the final validated versions, the AI responses, or the scoring keys.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments, which highlight opportunities to strengthen the transparency of our benchmark construction and evaluation protocols. We address each major comment below and will revise the manuscript to incorporate additional details where possible.

read point-by-point responses
  1. Referee: The description of the peer-review process provides no quantitative metrics: no inter-rater agreement scores, no revision or discard rates, no external expert validation, and no breakdown of how many of the 256 questions survived review. This is load-bearing for claims that the pass rates reflect genuine system limitations rather than design artifacts.

    Authors: We agree that quantitative metrics on the peer-review process would improve clarity and address potential concerns about question quality. In the revised manuscript, we will add a dedicated subsection on the review workflow. This will include: the total number of questions initially drafted by students, the number revised or discarded during peer review, any available inter-rater agreement statistics from the peer-review rounds, and a note that external expert validation was not performed (relying instead on the disciplinary expertise of the student authors and the verifiable nature of the questions). We maintain that the combination of student expertise, peer scrutiny for ambiguity and shortcuts, and the public release of the dataset supports the headline results, but we acknowledge the value of these additional metrics for reproducibility. revision: yes

  2. Referee: The evaluation protocol is described at a high level but does not specify the exact scoring rubric, whether answers were judged by the question authors or independent raters, or how partial credit or source verification was handled. This makes it difficult to interpret the pass rates or rule out post-hoc adjustments.

    Authors: We appreciate this observation and will expand the evaluation section in the revision. The updated text will specify: the binary pass/fail scoring rubric (an answer passes only if it fully satisfies all expert criteria listed in the question, including source selection, terminology, and evidence standards); that judgments were performed by the question authors with cross-checking by at least one peer reviewer; that no partial credit was awarded; and that source verification was required as part of the rubric. We will also include one or two concrete scoring examples to illustrate the process. These clarifications will make the protocol fully reproducible and eliminate ambiguity about how the 16.85% mean and 57.58% best-system rates were derived. revision: yes

Circularity Check

0 steps flagged

No significant circularity: empirical measurements on external systems

full rationale

The paper constructs QuestBench via student-designed questions and peer review, then reports direct empirical pass rates (mean 16.85%, best system 57.58%) on thirteen external AI systems. No equations, fitted parameters, or first-principles derivations appear; the central claims are measurements against independent models rather than reductions to self-referential inputs or self-citation chains. The evaluation protocol is self-contained against external benchmarks, with no load-bearing steps that equate outputs to inputs by construction. This is the expected honest non-finding for an empirical benchmark paper.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The work rests on standard assumptions about AI system behavior and educational peer review; no free parameters, mathematical axioms, or invented entities are introduced.

pith-pipeline@v0.9.0 · 5892 in / 1162 out tokens · 50272 ms · 2026-05-21T04:07:45.345185+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

47 extracted references · 47 canonical work pages · 10 internal anchors

  1. [1]

    Qwen3.6-plus: Towards real world agents, 2026

    Alibaba Cloud. Qwen3.6-plus: Towards real world agents, 2026. URL https://qwen.ai/ blog?id=qwen3.6. Accessed: 2026-05-06

  2. [2]

    What’s new in claude opus 4.7, 2026

    Anthropic. What’s new in claude opus 4.7, 2026. URL https://platform.claude.com/ docs/en/about-claude/models/whats-new-claude-4-7. Accessed: 2026-05-06

  3. [3]

    Seed1.8 Model Card: Towards Generalized Real-World Agency

    ByteDance Seed. Seed1.8 Model Card: Towards generalized real-world agency.arXiv preprint arXiv:2603.20633, 2026. doi: 10.48550/arXiv.2603.20633. URL https://arxiv.org/abs/ 2603.20633

  4. [4]

    Seed2.0 Model Card: Towards intelligence frontier for real-world complexity,

    ByteDance Seed. Seed2.0 Model Card: Towards intelligence frontier for real-world complexity,

  5. [5]

    Model card

    URLhttps://seed.bytedance.com/en/seed2. Model card

  6. [6]

    FinQA: A dataset of numerical reasoning over financial data

    Zhiyu Chen, Wenhu Chen, Charese Smiley, Sameena Shah, Iana Borova, Dylan Langdon, Reema Moussa, Matt Beane, Ting-Hao Huang, Bryan Routledge, and William Yang Wang. FinQA: A dataset of numerical reasoning over financial data. InProceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 3697–3711. Association for Computati...

  7. [7]

    DeepSeek V4 technical documentation, 2026

    DeepSeek AI. DeepSeek V4 technical documentation, 2026. URL https://fe-static. deepseek.com/chat/transparency/deepseek-V4-model-card-EN.pdf . Accessed: 2026-05-06

  8. [8]

    DeepSeek-V3.2: Pushing the Frontier of Open Large Language Models

    DeepSeek-AI, Aixin Liu, Aoxue Mei, Bangcai Lin, Bing Xue, Bingxuan Wang, Bingzheng Xu, Bochao Wu, Bowei Zhang, Chaofan Lin, et al. DeepSeek-V3.2: Pushing the frontier of open large language models.arXiv preprint arXiv:2512.02556, 2025. doi: 10.48550/arXiv.2512. 02556. URLhttps://arxiv.org/abs/2512.02556. 14

  9. [9]

    Mind2Web: Towards a generalist agent for the web

    Xiang Deng, Yu Gu, Boyuan Zheng, Shijie Chen, Sam Stevens, Boshi Wang, Huan Sun, and Yu Su. Mind2Web: Towards a generalist agent for the web. InAdvances in Neu- ral Information Processing Systems, volume 36, pages 28091–28114. Curran Associates, Inc., 2023. URL https://proceedings.neurips.cc/paper_files/paper/2023/file/ 5950bf290a1570ea401bf98882128160-Pa...

  10. [10]

    DeepResearch Bench: A Comprehensive Benchmark for Deep Research Agents

    Mingxuan Du, Benfeng Xu, Chiwei Zhu, Xiaorui Wang, and Zhendong Mao. DeepResearch Bench: A comprehensive benchmark for deep research agents.arXiv preprint arXiv:2506.11763,

  11. [11]

    DeepResearch Bench: A Comprehensive Benchmark for Deep Research Agents

    doi: 10.48550/arXiv.2506.11763. URLhttps://arxiv.org/abs/2506.11763

  12. [12]

    Gemini 3.1 pro model card, 2026

    Google DeepMind. Gemini 3.1 pro model card, 2026. URL https://deepmind.google/ models/model-cards/gemini-3-1-pro/. Accessed: 2026-05-06

  13. [13]

    LegalBench: A collaboratively built benchmark for measuring legal reasoning in large language models

    Neel Guha, Julian Nyarko, Daniel Ho, Christopher Ré, Adam Chilton, Aditya K, Alex Chohlas- Wood, Austin Peters, Brandon Waldon, Daniel Rockmore, et al. LegalBench: A collaboratively built benchmark for measuring legal reasoning in large language models. InAdvances in Neural Information Processing Systems, volume 36, pages 44123–44279. Curran Associates, I...

  14. [14]

    DeepSearchQA: Bridging the comprehensiveness gap for deep research agents.arXiv preprint arXiv:2601.20975, 2026

    Nikita Gupta, Riju Chatterjee, Lukas Haas, Connie Tao, Andrew Wang, Chang Liu, Hidekazu Oiwa, Elena Gribovskaya, Jan Ackermann, John Blitzer, Sasha Goldshtein, and Dipanjan Das. DeepSearchQA: Bridging the comprehensiveness gap for deep research agents.arXiv preprint arXiv:2601.20975, 2026. URLhttps://arxiv.org/abs/2601.20975

  15. [15]

    PubMedQA: A dataset for biomedical research question answering

    Qiao Jin, Bhuwan Dhingra, Zhengping Liu, William Cohen, and Xinghua Lu. PubMedQA: A dataset for biomedical research question answering. InProceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, pages 2567–2577. Association for Computational Linguistics,

  16. [16]

    URLhttps://aclanthology.org/D19-1259/

    doi: 10.18653/v1/D19-1259. URLhttps://aclanthology.org/D19-1259/

  17. [17]

    Joshi, E

    Mandar Joshi, Eunsol Choi, Daniel Weld, and Luke Zettlemoyer. TriviaQA: A large scale distantly supervised challenge dataset for reading comprehension. InProceedings of the 55th Annual Meeting of the Association for Computational Linguistics, pages 1601–1611. Association for Computational Linguistics, 2017. doi: 10.18653/v1/P17-1147. URL https: //aclantho...

  18. [18]

    ResearchArena: Benchmarking large language models’ ability to collect and organize information as research agents

    Hao Kang and Chenyan Xiong. ResearchArena: Benchmarking large language models’ ability to collect and organize information as research agents. InFindings of the Association for Computational Linguistics: EMNLP 2025, pages 5653–5671. Association for Computational Linguistics, 2025. doi: 10.18653/v1/2025.findings-emnlp.303. URL https://aclanthology. org/202...

  19. [19]

    Kimi Team, Tongtong Bai, Yifan Bai, Yiping Bao, S. H. Cai, Yuan Cao, Y . Charles, H. S. Che, Cheng Chen, Guanduo Chen, et al. Kimi K2.5: Visual agentic intelligence.arXiv preprint arXiv:2602.02276, 2026. doi: 10.48550/arXiv.2602.02276. URL https://arxiv.org/abs/ 2602.02276

  20. [20]

    ISBN 979-8-89176-256-5

    Satyapriya Krishna, Kalpesh Krishna, Anhad Mohananey, Steven Schwarcz, Adam Stambler, Shyam Upadhyay, and Manaal Faruqui. Fact, fetch, and reason: A unified evaluation of retrieval- augmented generation. InProceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies...

  21. [21]

    Natural questions: A benchmark for question answering research.Transactions of the Association for Computational Linguistics, 7:452–466, 2019

    Tom Kwiatkowski, Jennimaria Palomaki, Olivia Redfield, Michael Collins, Ankur Parikh, Chris Alberti, Danielle Epstein, Illia Polosukhin, Jacob Devlin, Kenton Lee, et al. Natural questions: A benchmark for question answering research.Transactions of the Association for Computational Linguistics, 7:452–466, 2019. doi: 10.1162/tacl_a_00276. URL https: //acla...

  22. [22]

    AgentBench: Eval- uating LLMs as agents

    Xiao Liu, Hao Yu, Hanchen Zhang, Yifan Xu, Xuanyu Lei, Hanyu Lai, Yu Gu, Hangliang Ding, Kaiwen Men, Kejuan Yang, Shudan Zhang, Xiang Deng, Aohan Zeng, Zhengxiao Du, Chenhui Zhang, Sheng Shen, Tianjun Zhang, Yu Su, Huan Sun, Minlie Huang, et al. AgentBench: Eval- uating LLMs as agents. InThe Twelfth International Conference on Learning Representations. Op...

  23. [23]

    SuperGPQA: Scaling LLM Evaluation across 285 Graduate Disciplines

    M-A-P Team, Xinrun Du, Yifan Yao, Kaijing Ma, Bingli Wang, Tianyu Zheng, King Zhu, Minghao Liu, Yiming Liang, Xiaolong Jin, et al. SuperGPQA: Scaling LLM evaluation across 285 graduate disciplines.arXiv preprint arXiv:2502.14739, 2025. doi: 10.48550/arXiv.2502. 14739. URLhttps://arxiv.org/abs/2502.14739

  24. [24]

    GAIA: A benchmark for general AI assistants

    Grégoire Mialon, Clémentine Fourrier, Thomas Wolf, Yann LeCun, and Thomas Scialom. GAIA: A benchmark for general AI assistants. InThe Twelfth International Conference on Learning Representations. OpenReview.net, 2024. URL https://openreview.net/forum? id=fibxvahvs3

  25. [25]

    MiniMax M2.7: Early echoes of self-evolution, 2026

    MiniMax. MiniMax M2.7: Early echoes of self-evolution, 2026. URLhttps://www.minimax. io/news/minimax-m27-en. Accessed: 2026-05-06

  26. [26]

    Kimi K2.6 tech blog: Advancing open-source coding, 2026

    Moonshot AI. Kimi K2.6 tech blog: Advancing open-source coding, 2026. URL https: //www.kimi.com/blog/kimi-k2-6. Accessed: 2026-05-06

  27. [27]

    GPT-5.5 system card, 2026

    OpenAI. GPT-5.5 system card, 2026. URL https://openai.com/index/ gpt-5-5-system-card/. Accessed: 2026-05-06

  28. [28]

    Humanity's Last Exam

    Long Phan, Alice Gatti, Nathaniel Li, Adam Khoja, Ryan Kim, Richard Ren, Jason Hausenloy, Oliver Zhang, Mantas Mazeika, Dan Hendrycks, et al. A benchmark of expert-level academic questions to assess AI capabilities.Nature, 649(8099):1139–1146, 2026. doi: 10.1038/ s41586-025-09962-4. URLhttps://doi.org/10.1038/s41586-025-09962-4

  29. [29]

    David Rein, Betty Li Hou, Asa Cooper Stickland, Jackson Petty, Richard Yuanzhe Pang, Julien Dirani, Julian Michael, and Samuel R. Bowman. GPQA: A graduate-level google- proof q&a benchmark. InThe Twelfth International Conference on Learning Representations. OpenReview.net, 2024. URLhttps://openreview.net/forum?id=TGnEMgh4PB

  30. [30]

    RAGChecker: A fine-grained framework for diagnosing retrieval-augmented generation

    Dongyu Ru, Lin Qiu, Xiangkun Hu, Tianhang Zhang, Peng Shi, Shuaichen Chang, Cheng Jiayang, Cunxiang Wang, Shichao Sun, Huanyu Li, Zizhao Zhang, Binjie Wang, Jiarong Jiang, Tong He, Zhiguo Wang, Pengfei Liu, Yue Zhang, and Zheng Zhang. RAGChecker: A fine-grained framework for diagnosing retrieval-augmented generation. InAdvances in Neural Information Proce...

  31. [31]

    URL https://proceedings.neurips.cc/paper_ files/paper/2024/file/27245589131d17368cccdfa990cbf16e-Paper-Datasets_ and_Benchmarks_Track.pdf

    doi: 10.52202/079017-0692. URL https://proceedings.neurips.cc/paper_ files/paper/2024/file/27245589131d17368cccdfa990cbf16e-Paper-Datasets_ and_Benchmarks_Track.pdf

  32. [32]

    PaperBench: Evaluating AI's Ability to Replicate AI Research

    Giulio Starace, Oliver Jaffe, Dane Sherburn, James Aung, Jun Shern Chan, Leon Maksin, Rachel Dias, Evan Mays, Benjamin Kinsella, Wyatt Thompson, Johannes Heidecke, Amelia Glaese, and Tejal Patwardhan. PaperBench: Evaluating AI’s ability to replicate AI research. arXiv preprint arXiv:2504.01848, 2025. doi: 10.48550/arXiv.2504.01848. URL https: //arxiv.org/...

  33. [33]

    FreshLLMs: Refreshing large language models with search engine augmentation.arXiv preprint arXiv:2310.03214, 2023

    Tu Vu, Mohit Iyyer, Xuezhi Wang, Noah Constant, Jerry Wei, Jason Wei, Chris Tar, Yun-Hsuan Sung, Denny Zhou, Quoc Le, and Thang Luong. FreshLLMs: Refreshing large language models with search engine augmentation.arXiv preprint arXiv:2310.03214, 2023. URL https://arxiv.org/abs/2310.03214

  34. [34]

    MMLU-Pro: A more robust and challenging multi-task language understanding benchmark

    Yubo Wang, Xueguang Ma, Ge Zhang, Yuansheng Ni, Abhranil Chandra, Shiguang Guo, Weiming Ren, Aaran Arulraj, Xuan He, Ziyan Jiang, Tianle Li, Max Ku, Kai Wang, Alex Zhuang, Rongqi Fan, Xiang Yue, and Wenhu Chen. MMLU-Pro: A more robust and challenging multi-task language understanding benchmark. InAdvances in Neural Information Processing Systems, volume 3...

  35. [35]

    URL https://proceedings.neurips.cc/paper_files/paper/2024/file/ ad236edc564f3e3156e1b2feafb99a24-Paper-Datasets_and_Benchmarks_Track. pdf. 16

  36. [36]

    BrowseComp: A Simple Yet Challenging Benchmark for Browsing Agents

    Jason Wei, Zhiqing Sun, Spencer Papay, Scott McKinney, Jeffrey Han, Isa Fulford, Hyung Won Chung, Alex Tachard Passos, William Fedus, and Amelia Glaese. Browsecomp: A simple yet challenging benchmark for browsing agents.arXiv preprint arXiv:2504.12516, 2025. URL https://arxiv.org/abs/2504.12516

  37. [37]

    MiMo-V2.5-Pro, 2026

    Xiaomi. MiMo-V2.5-Pro, 2026. URL https://mimo.xiaomi.com/mimo-v2-5-pro . Ac- cessed: 2026-05-06

  38. [38]

    Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio, William Cohen, Ruslan Salakhutdinov, and Christopher D. Manning. HotpotQA: A dataset for diverse, explainable multi-hop question answering. InProceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 2369–2380. Association for Computational Linguistics, 2018. doi: 10....

  39. [39]

    τ-bench: A benchmark for tool-agent-user interaction in real-world domains

    Shunyu Yao, Noah Shinn, Pedram Razavi, and Karthik Narasimhan. τ-bench: A benchmark for tool-agent-user interaction in real-world domains. InThe Thir- teenth International Conference on Learning Representations. OpenReview.net,

  40. [40]

    URL https://proceedings.iclr.cc/paper_files/paper/2025/hash/ 1b126cc38b8638e07bef37e7b2bb72bf-Abstract-Conference.html

  41. [41]

    GLM-5.1 model card, 2026

    Zhipu AI. GLM-5.1 model card, 2026. URL https://www.modelscope.cn/models/ ZhipuAI/GLM-5.1. Accessed: 2026-05-06

  42. [42]

    BrowseComp-ZH: Benchmarking Web Browsing Ability of Large Language Models in Chinese

    Peilin Zhou, Bruce Leon, Xiang Ying, Can Zhang, Yifan Shao, Qichen Ye, Dading Chong, Zhiling Jin, Chenxuan Xie, Meng Cao, Yuxin Gu, Sixin Hong, Jing Ren, Jian Chen, Chao Liu, and Yining Hua. BrowseComp-ZH: Benchmarking web browsing ability of large language models in chinese.arXiv preprint arXiv:2504.19314, 2025. doi: 10.48550/arXiv.2504.19314. URLhttps:/...

  43. [43]

    counter-guarantee

    Shuyan Zhou, Frank F. Xu, Hao Zhu, Xuhui Zhou, Robert Lo, Abishek Sridhar, Xianyi Cheng, Tianyue Ou, Yonatan Bisk, Daniel Fried, Uri Alon, and Graham Neubig. WebArena: A realistic web environment for building autonomous agents. InThe Twelfth International Conference on Learning Representations. OpenReview.net, 2024. URL https://openreview.net/forum? id=oK...

  44. [44]

    Conduct systematic searches to locate relevant sources

  45. [45]

    Visit and analyze authoritative sources carefully

  46. [46]

    Synthesize information from multiple sources as needed

  47. [47]

    Law”, “Legal Studies

    Provide your final answer in <answer></answer> tags Final answer format: <answer>Your answer here</answer> This prompt asks for systematic research without giving domain-specific hints. It matches expert search settings where users have questions but no pre-identified sources. E.3 Scoring Methodology Each question specifies detailed grading criteria with ...