Teaching AI Through Benchmark Construction: QuestBench as a Course-Based Practice for Accountable Knowledge Work

Chongyang Pan; Haiyang Shen; Jiuzheng Wang; Mugeng Liu; Siqi Zhong; Taian Guo; Weichen Bi; Wenchun Jing; Xiaoying Bai; Yudong Han

arxiv: 2605.21413 · v1 · pith:FJYSECMSnew · submitted 2026-05-20 · 💻 cs.AI

Teaching AI Through Benchmark Construction: QuestBench as a Course-Based Practice for Accountable Knowledge Work

Haiyang Shen , Jiuzheng Wang , Taian Guo , Mugeng Liu , Wenchun Jing , Chongyang Pan , Siqi Zhong , Zhiyang Chen

show 4 more authors

Weichen Bi Yudong Han Xiaoying Bai Yun Ma

This is my paper

Pith reviewed 2026-05-21 04:07 UTC · model grok-4.3

classification 💻 cs.AI

keywords AI educationbenchmark constructionQuestBenchdeep research systemsaccountable knowledge workstudent-designed questionsAI evaluationhumanities and social sciences

0 comments

The pith

Building benchmarks in class teaches students to judge AI and shows that current deep research systems fail most expert-level questions.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes that AI education should move beyond teaching students to use AI tools and instead include constructing benchmarks to test AI outputs. In this course practice, students draw on their disciplinary knowledge to create verifiable expert-level questions, review one another's designs to remove ambiguity and shortcuts, and then evaluate multiple AI systems on the resulting tasks. The produced QuestBench contains 256 questions across 14 humanities and social-science domains. Evaluation of thirteen systems yields a mean pass rate of only 16.85 percent, with the top system reaching 57.58 percent, and the failures highlight cases where fluent answers still miss the right query, source, term, or evidence standard. Student reflections indicate that the activity helps them view professional knowledge as the basis for judging AI rather than content AI simply retrieves.

Core claim

Students transform their disciplinary expertise into 256 verifiable questions across 14 domains to form QuestBench. After peer review to remove ambiguity and shortcuts, evaluation of thirteen AI deep research systems shows a mean question-level pass rate of 16.85 percent, with GPT-5.5 achieving the highest at 57.58 percent. Failures often involve missing the precise query, appropriate source, key term, or evidence standard despite fluent and sourced answers. Student reflections indicate that constructing the benchmark shifts their view of knowledge from something AI retrieves to the foundation for judging AI outputs.

What carries the argument

The course-based benchmark construction process, where students design expert-level questions, conduct peer review for ambiguity and shortcuts, and evaluate AI systems on the resulting tasks.

Load-bearing premise

That student peer review sufficiently removes ambiguity and shortcuts from the questions so that AI failures reflect actual system limitations rather than question flaws.

What would settle it

Independent domain experts reviewing all 256 QuestBench questions and confirming they are clear and solvable by skilled humans, followed by re-testing the same thirteen systems and finding persistently low pass rates, would support the claim of genuine system limitations.

Figures

Figures reproduced from arXiv: 2605.21413 by Chongyang Pan, Haiyang Shen, Jiuzheng Wang, Mugeng Liu, Siqi Zhong, Taian Guo, Weichen Bi, Wenchun Jing, Xiaoying Bai, Yudong Han, Yun Ma, Zhiyang Chen.

**Figure 1.** Figure 1: Conceptual framework of QUESTBENCH as course-based benchmark construction for teaching accountable AI-mediated knowledge work. Students first encounter deep research systems as a practical tool, then use benchmark construction to design expert-level questions, test shortcuts, validate answers, evaluate models, and analyze failures. The course links tool exposure with question design, disciplinary standards… view at source ↗

**Figure 2.** Figure 2: Course and technical pipeline for QUESTBENCH. Students transform disciplinary knowledge into expert-level question packages, then filter them through preliminary screening, answer verification, gradingcriteria audit, anti-shortcut validation, and domain normalization. The same artifacts are then used for model evaluation, scoring, and failure analysis, turning task design into practice in accountable AI-m… view at source ↗

**Figure 3.** Figure 3: Left: Distribution across normalized domain groups. Right: Empirical cross-model question pass-rate [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗

**Figure 4.** Figure 4: Left: score distributions across models. Right: average tool calls and fraction of runs exceeding 50 [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗

**Figure 5.** Figure 5: Domain and error analysis. Left: mean scores for the largest normalized domains across the thirteen [PITH_FULL_IMAGE:figures/full_fig_p010_5.png] view at source ↗

read the original abstract

As AI becomes part of everyday learning, many courses teach students to use it mainly as a productivity tool: how to prompt, search, summarize, write, code, and use tools more efficiently. We argue that AI education also needs a setting in which students learn to test AI and understand their own role in judging machine-produced knowledge. To this end, we introduce a course-based practice that teaches AI through benchmark construction, using deep research systems as a concrete example of AI-era knowledge work. Students turn disciplinary knowledge into verifiable expert-level questions, review one another's designs for ambiguity and shortcuts, and evaluate AI systems on the resulting tasks. This activity gives students direct exposure to a powerful tool while asking them to specify what a trustworthy answer would require. The produced benchmark, QuestBench, consists of 256 questions across 14 humanities and social-science domains. Evaluation on QuestBench shows that student-designed tasks reveal hidden failures in current deep research systems: across thirteen evaluated systems, the mean question-level pass rate is only 16.85%, and the best-performing system, GPT-5.5, reaches a 57.58% pass rate. The failures are educationally useful because they show how fluent, source-backed answers can still miss the right query, source, term, or evidence standard. Reflections from five student contributors suggest that benchmark construction can help students see professional knowledge not only as content AI may retrieve, but as the basis for judging AI outputs. We present QuestBench as a benchmark artifact and as a reusable classroom setting for a larger educational question: how students can remain responsible knowledge actors as AI enters learning and professional work. The dataset is available at https://huggingface.co/datasets/PKUAIWeb/QuestBench/tree/main.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper's main contribution is a classroom workflow where students build and peer-review expert-level questions to test AI deep-research systems, producing QuestBench and showing low pass rates, but the quality controls on those questions remain lightly documented.

read the letter

The core idea here is straightforward: instead of just teaching students to use AI tools, have them construct verifiable questions from their own disciplinary knowledge, review each other's work for shortcuts and ambiguity, and then run the resulting tasks against current systems. QuestBench ends up with 256 questions across 14 fields, and the reported numbers are low—mean pass rate around 17 percent, best system at 58 percent. That framing of benchmark construction as an educational activity is the genuinely new piece; prior work on benchmarks exists, but tying it directly to student-led peer review in a course setting for teaching accountability is not standard practice yet. The public dataset on Hugging Face is a clear plus for anyone who wants to test further or replicate the exercise. The reflections from five students also give a sense that the activity shifts how participants think about judging AI outputs rather than just consuming them. The soft spot is the validation step. The abstract describes student peer review to catch ambiguity and shortcuts, but supplies no numbers on revision rates, inter-rater agreement, how many questions were dropped, or any external expert check. Without those details it is hard to rule out that some of the reported failures come from question wording rather than system limits. The evaluation protocol itself is not fully spelled out either. This is not a fatal problem for an education-focused paper, but it does mean the claim that the tasks reveal “hidden failures” rests on an assumption that needs more evidence. The work is aimed at instructors in AI, information science, or humanities courses who are looking for concrete ways to move beyond prompt engineering. A reader who wants a ready-made classroom module and a new test set will get immediate value; someone looking for a rigorously validated benchmark will want the full methods section first. It is coherent on its own terms and shows honest engagement with the teaching problem, so it deserves a serious referee to sort out the quality-control questions and see whether the approach scales.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces a course-based practice for teaching AI through benchmark construction, using deep research systems as the target. Students in humanities and social-science domains create 256 verifiable expert-level questions (QuestBench), peer-review one another's designs to remove ambiguity and shortcuts, and then evaluate 13 AI systems on the resulting tasks. The central empirical result is that current systems perform poorly, with a mean question-level pass rate of 16.85% and the best system (GPT-5.5) reaching only 57.58%. The authors argue that the failures are educationally useful, that the activity helps students see professional knowledge as a basis for judging AI outputs, and that QuestBench serves both as a public benchmark artifact and as a reusable classroom setting for accountable knowledge work.

Significance. If the questions are verifiably expert-level and free of design artifacts, the work supplies a concrete, replicable pedagogical intervention that shifts AI education from tool-use to critical evaluation. The public release of the 256-question dataset on Hugging Face is a clear strength for reproducibility and follow-on research. The reported performance gap also offers falsifiable evidence about the current limits of deep research systems on tasks that require precise source selection, terminology, and evidence standards.

major comments (2)

[Benchmark construction and review process] The description of the peer-review process (the account of students reviewing one another's designs for ambiguity and shortcuts) provides no quantitative metrics: no inter-rater agreement scores, no revision or discard rates, no external expert validation, and no breakdown of how many of the 256 questions survived review. This detail is load-bearing for the headline claim that the 16.85% mean and 57.58% best-system pass rates reflect genuine system limitations rather than question-design artifacts.
[Evaluation on QuestBench] The evaluation protocol is described at a high level but does not specify the exact scoring rubric, whether answers were judged by the question authors or independent raters, or how partial credit or source verification was handled. Without these details it is difficult to interpret the reported pass rates or to rule out post-hoc adjustments.

minor comments (2)

[Introduction] The term 'deep research systems' is used repeatedly but never given an explicit operational definition or list of the thirteen evaluated models beyond the mention of GPT-5.5; a short table or footnote would improve clarity.
[Data availability] The dataset link is provided, but the manuscript does not state whether the released files include the original student questions, the final validated versions, the AI responses, or the scoring keys.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments, which highlight opportunities to strengthen the transparency of our benchmark construction and evaluation protocols. We address each major comment below and will revise the manuscript to incorporate additional details where possible.

read point-by-point responses

Referee: The description of the peer-review process provides no quantitative metrics: no inter-rater agreement scores, no revision or discard rates, no external expert validation, and no breakdown of how many of the 256 questions survived review. This is load-bearing for claims that the pass rates reflect genuine system limitations rather than design artifacts.

Authors: We agree that quantitative metrics on the peer-review process would improve clarity and address potential concerns about question quality. In the revised manuscript, we will add a dedicated subsection on the review workflow. This will include: the total number of questions initially drafted by students, the number revised or discarded during peer review, any available inter-rater agreement statistics from the peer-review rounds, and a note that external expert validation was not performed (relying instead on the disciplinary expertise of the student authors and the verifiable nature of the questions). We maintain that the combination of student expertise, peer scrutiny for ambiguity and shortcuts, and the public release of the dataset supports the headline results, but we acknowledge the value of these additional metrics for reproducibility. revision: yes
Referee: The evaluation protocol is described at a high level but does not specify the exact scoring rubric, whether answers were judged by the question authors or independent raters, or how partial credit or source verification was handled. This makes it difficult to interpret the pass rates or rule out post-hoc adjustments.

Authors: We appreciate this observation and will expand the evaluation section in the revision. The updated text will specify: the binary pass/fail scoring rubric (an answer passes only if it fully satisfies all expert criteria listed in the question, including source selection, terminology, and evidence standards); that judgments were performed by the question authors with cross-checking by at least one peer reviewer; that no partial credit was awarded; and that source verification was required as part of the rubric. We will also include one or two concrete scoring examples to illustrate the process. These clarifications will make the protocol fully reproducible and eliminate ambiguity about how the 16.85% mean and 57.58% best-system rates were derived. revision: yes

Circularity Check

0 steps flagged

No significant circularity: empirical measurements on external systems

full rationale

The paper constructs QuestBench via student-designed questions and peer review, then reports direct empirical pass rates (mean 16.85%, best system 57.58%) on thirteen external AI systems. No equations, fitted parameters, or first-principles derivations appear; the central claims are measurements against independent models rather than reductions to self-referential inputs or self-citation chains. The evaluation protocol is self-contained against external benchmarks, with no load-bearing steps that equate outputs to inputs by construction. This is the expected honest non-finding for an empirical benchmark paper.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The work rests on standard assumptions about AI system behavior and educational peer review; no free parameters, mathematical axioms, or invented entities are introduced.

pith-pipeline@v0.9.0 · 5892 in / 1162 out tokens · 50272 ms · 2026-05-21T04:07:45.345185+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

47 extracted references · 47 canonical work pages · 10 internal anchors

[1]

Qwen3.6-plus: Towards real world agents, 2026

Alibaba Cloud. Qwen3.6-plus: Towards real world agents, 2026. URL https://qwen.ai/ blog?id=qwen3.6. Accessed: 2026-05-06

work page 2026
[2]

What’s new in claude opus 4.7, 2026

Anthropic. What’s new in claude opus 4.7, 2026. URL https://platform.claude.com/ docs/en/about-claude/models/whats-new-claude-4-7. Accessed: 2026-05-06

work page 2026
[3]

Seed1.8 Model Card: Towards Generalized Real-World Agency

ByteDance Seed. Seed1.8 Model Card: Towards generalized real-world agency.arXiv preprint arXiv:2603.20633, 2026. doi: 10.48550/arXiv.2603.20633. URL https://arxiv.org/abs/ 2603.20633

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2603.20633 2026
[4]

Seed2.0 Model Card: Towards intelligence frontier for real-world complexity,

ByteDance Seed. Seed2.0 Model Card: Towards intelligence frontier for real-world complexity,

work page
[5]

Model card

URLhttps://seed.bytedance.com/en/seed2. Model card

work page
[6]

FinQA: A dataset of numerical reasoning over financial data

Zhiyu Chen, Wenhu Chen, Charese Smiley, Sameena Shah, Iana Borova, Dylan Langdon, Reema Moussa, Matt Beane, Ting-Hao Huang, Bryan Routledge, and William Yang Wang. FinQA: A dataset of numerical reasoning over financial data. InProceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 3697–3711. Association for Computati...

work page doi:10.18653/v1/2021.emnlp-main.300 2021
[7]

DeepSeek V4 technical documentation, 2026

DeepSeek AI. DeepSeek V4 technical documentation, 2026. URL https://fe-static. deepseek.com/chat/transparency/deepseek-V4-model-card-EN.pdf . Accessed: 2026-05-06

work page 2026
[8]

DeepSeek-V3.2: Pushing the Frontier of Open Large Language Models

DeepSeek-AI, Aixin Liu, Aoxue Mei, Bangcai Lin, Bing Xue, Bingxuan Wang, Bingzheng Xu, Bochao Wu, Bowei Zhang, Chaofan Lin, et al. DeepSeek-V3.2: Pushing the frontier of open large language models.arXiv preprint arXiv:2512.02556, 2025. doi: 10.48550/arXiv.2512. 02556. URLhttps://arxiv.org/abs/2512.02556. 14

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2512 2025
[9]

Mind2Web: Towards a generalist agent for the web

Xiang Deng, Yu Gu, Boyuan Zheng, Shijie Chen, Sam Stevens, Boshi Wang, Huan Sun, and Yu Su. Mind2Web: Towards a generalist agent for the web. InAdvances in Neu- ral Information Processing Systems, volume 36, pages 28091–28114. Curran Associates, Inc., 2023. URL https://proceedings.neurips.cc/paper_files/paper/2023/file/ 5950bf290a1570ea401bf98882128160-Pa...

work page 2023
[10]

DeepResearch Bench: A Comprehensive Benchmark for Deep Research Agents

Mingxuan Du, Benfeng Xu, Chiwei Zhu, Xiaorui Wang, and Zhendong Mao. DeepResearch Bench: A comprehensive benchmark for deep research agents.arXiv preprint arXiv:2506.11763,

work page internal anchor Pith review Pith/arXiv arXiv
[11]

DeepResearch Bench: A Comprehensive Benchmark for Deep Research Agents

doi: 10.48550/arXiv.2506.11763. URLhttps://arxiv.org/abs/2506.11763

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2506.11763
[12]

Gemini 3.1 pro model card, 2026

Google DeepMind. Gemini 3.1 pro model card, 2026. URL https://deepmind.google/ models/model-cards/gemini-3-1-pro/. Accessed: 2026-05-06

work page 2026
[13]

LegalBench: A collaboratively built benchmark for measuring legal reasoning in large language models

Neel Guha, Julian Nyarko, Daniel Ho, Christopher Ré, Adam Chilton, Aditya K, Alex Chohlas- Wood, Austin Peters, Brandon Waldon, Daniel Rockmore, et al. LegalBench: A collaboratively built benchmark for measuring legal reasoning in large language models. InAdvances in Neural Information Processing Systems, volume 36, pages 44123–44279. Curran Associates, I...

work page 2023
[14]

DeepSearchQA: Bridging the comprehensiveness gap for deep research agents.arXiv preprint arXiv:2601.20975, 2026

Nikita Gupta, Riju Chatterjee, Lukas Haas, Connie Tao, Andrew Wang, Chang Liu, Hidekazu Oiwa, Elena Gribovskaya, Jan Ackermann, John Blitzer, Sasha Goldshtein, and Dipanjan Das. DeepSearchQA: Bridging the comprehensiveness gap for deep research agents.arXiv preprint arXiv:2601.20975, 2026. URLhttps://arxiv.org/abs/2601.20975

work page arXiv 2026
[15]

PubMedQA: A dataset for biomedical research question answering

Qiao Jin, Bhuwan Dhingra, Zhengping Liu, William Cohen, and Xinghua Lu. PubMedQA: A dataset for biomedical research question answering. InProceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, pages 2567–2577. Association for Computational Linguistics,

work page 2019
[16]

URLhttps://aclanthology.org/D19-1259/

doi: 10.18653/v1/D19-1259. URLhttps://aclanthology.org/D19-1259/

work page doi:10.18653/v1/d19-1259
[17]

Joshi, E

Mandar Joshi, Eunsol Choi, Daniel Weld, and Luke Zettlemoyer. TriviaQA: A large scale distantly supervised challenge dataset for reading comprehension. InProceedings of the 55th Annual Meeting of the Association for Computational Linguistics, pages 1601–1611. Association for Computational Linguistics, 2017. doi: 10.18653/v1/P17-1147. URL https: //aclantho...

work page doi:10.18653/v1/p17-1147 2017
[18]

ResearchArena: Benchmarking large language models’ ability to collect and organize information as research agents

Hao Kang and Chenyan Xiong. ResearchArena: Benchmarking large language models’ ability to collect and organize information as research agents. InFindings of the Association for Computational Linguistics: EMNLP 2025, pages 5653–5671. Association for Computational Linguistics, 2025. doi: 10.18653/v1/2025.findings-emnlp.303. URL https://aclanthology. org/202...

work page doi:10.18653/v1/2025.findings-emnlp.303 2025
[19]

Kimi Team, Tongtong Bai, Yifan Bai, Yiping Bao, S. H. Cai, Yuan Cao, Y . Charles, H. S. Che, Cheng Chen, Guanduo Chen, et al. Kimi K2.5: Visual agentic intelligence.arXiv preprint arXiv:2602.02276, 2026. doi: 10.48550/arXiv.2602.02276. URL https://arxiv.org/abs/ 2602.02276

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2602.02276 2026
[20]

ISBN 979-8-89176-256-5

Satyapriya Krishna, Kalpesh Krishna, Anhad Mohananey, Steven Schwarcz, Adam Stambler, Shyam Upadhyay, and Manaal Faruqui. Fact, fetch, and reason: A unified evaluation of retrieval- augmented generation. InProceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies...

work page doi:10.18653/v1/2025 2025
[21]

Natural questions: A benchmark for question answering research.Transactions of the Association for Computational Linguistics, 7:452–466, 2019

Tom Kwiatkowski, Jennimaria Palomaki, Olivia Redfield, Michael Collins, Ankur Parikh, Chris Alberti, Danielle Epstein, Illia Polosukhin, Jacob Devlin, Kenton Lee, et al. Natural questions: A benchmark for question answering research.Transactions of the Association for Computational Linguistics, 7:452–466, 2019. doi: 10.1162/tacl_a_00276. URL https: //acla...

work page doi:10.1162/tacl_a_00276 2019
[22]

AgentBench: Eval- uating LLMs as agents

Xiao Liu, Hao Yu, Hanchen Zhang, Yifan Xu, Xuanyu Lei, Hanyu Lai, Yu Gu, Hangliang Ding, Kaiwen Men, Kejuan Yang, Shudan Zhang, Xiang Deng, Aohan Zeng, Zhengxiao Du, Chenhui Zhang, Sheng Shen, Tianjun Zhang, Yu Su, Huan Sun, Minlie Huang, et al. AgentBench: Eval- uating LLMs as agents. InThe Twelfth International Conference on Learning Representations. Op...

work page 2024
[23]

SuperGPQA: Scaling LLM Evaluation across 285 Graduate Disciplines

M-A-P Team, Xinrun Du, Yifan Yao, Kaijing Ma, Bingli Wang, Tianyu Zheng, King Zhu, Minghao Liu, Yiming Liang, Xiaolong Jin, et al. SuperGPQA: Scaling LLM evaluation across 285 graduate disciplines.arXiv preprint arXiv:2502.14739, 2025. doi: 10.48550/arXiv.2502. 14739. URLhttps://arxiv.org/abs/2502.14739

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2502 2025
[24]

GAIA: A benchmark for general AI assistants

Grégoire Mialon, Clémentine Fourrier, Thomas Wolf, Yann LeCun, and Thomas Scialom. GAIA: A benchmark for general AI assistants. InThe Twelfth International Conference on Learning Representations. OpenReview.net, 2024. URL https://openreview.net/forum? id=fibxvahvs3

work page 2024
[25]

MiniMax M2.7: Early echoes of self-evolution, 2026

MiniMax. MiniMax M2.7: Early echoes of self-evolution, 2026. URLhttps://www.minimax. io/news/minimax-m27-en. Accessed: 2026-05-06

work page 2026
[26]

Kimi K2.6 tech blog: Advancing open-source coding, 2026

Moonshot AI. Kimi K2.6 tech blog: Advancing open-source coding, 2026. URL https: //www.kimi.com/blog/kimi-k2-6. Accessed: 2026-05-06

work page 2026
[27]

GPT-5.5 system card, 2026

OpenAI. GPT-5.5 system card, 2026. URL https://openai.com/index/ gpt-5-5-system-card/. Accessed: 2026-05-06

work page 2026
[28]

Humanity's Last Exam

Long Phan, Alice Gatti, Nathaniel Li, Adam Khoja, Ryan Kim, Richard Ren, Jason Hausenloy, Oliver Zhang, Mantas Mazeika, Dan Hendrycks, et al. A benchmark of expert-level academic questions to assess AI capabilities.Nature, 649(8099):1139–1146, 2026. doi: 10.1038/ s41586-025-09962-4. URLhttps://doi.org/10.1038/s41586-025-09962-4

work page internal anchor Pith review doi:10.1038/s41586-025-09962-4 2026
[29]

David Rein, Betty Li Hou, Asa Cooper Stickland, Jackson Petty, Richard Yuanzhe Pang, Julien Dirani, Julian Michael, and Samuel R. Bowman. GPQA: A graduate-level google- proof q&a benchmark. InThe Twelfth International Conference on Learning Representations. OpenReview.net, 2024. URLhttps://openreview.net/forum?id=TGnEMgh4PB

work page 2024
[30]

RAGChecker: A fine-grained framework for diagnosing retrieval-augmented generation

Dongyu Ru, Lin Qiu, Xiangkun Hu, Tianhang Zhang, Peng Shi, Shuaichen Chang, Cheng Jiayang, Cunxiang Wang, Shichao Sun, Huanyu Li, Zizhao Zhang, Binjie Wang, Jiarong Jiang, Tong He, Zhiguo Wang, Pengfei Liu, Yue Zhang, and Zheng Zhang. RAGChecker: A fine-grained framework for diagnosing retrieval-augmented generation. InAdvances in Neural Information Proce...

work page
[31]

URL https://proceedings.neurips.cc/paper_ files/paper/2024/file/27245589131d17368cccdfa990cbf16e-Paper-Datasets_ and_Benchmarks_Track.pdf

doi: 10.52202/079017-0692. URL https://proceedings.neurips.cc/paper_ files/paper/2024/file/27245589131d17368cccdfa990cbf16e-Paper-Datasets_ and_Benchmarks_Track.pdf

work page doi:10.52202/079017-0692 2024
[32]

PaperBench: Evaluating AI's Ability to Replicate AI Research

Giulio Starace, Oliver Jaffe, Dane Sherburn, James Aung, Jun Shern Chan, Leon Maksin, Rachel Dias, Evan Mays, Benjamin Kinsella, Wyatt Thompson, Johannes Heidecke, Amelia Glaese, and Tejal Patwardhan. PaperBench: Evaluating AI’s ability to replicate AI research. arXiv preprint arXiv:2504.01848, 2025. doi: 10.48550/arXiv.2504.01848. URL https: //arxiv.org/...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2504.01848 2025
[33]

FreshLLMs: Refreshing large language models with search engine augmentation.arXiv preprint arXiv:2310.03214, 2023

Tu Vu, Mohit Iyyer, Xuezhi Wang, Noah Constant, Jerry Wei, Jason Wei, Chris Tar, Yun-Hsuan Sung, Denny Zhou, Quoc Le, and Thang Luong. FreshLLMs: Refreshing large language models with search engine augmentation.arXiv preprint arXiv:2310.03214, 2023. URL https://arxiv.org/abs/2310.03214

work page arXiv 2023
[34]

MMLU-Pro: A more robust and challenging multi-task language understanding benchmark

Yubo Wang, Xueguang Ma, Ge Zhang, Yuansheng Ni, Abhranil Chandra, Shiguang Guo, Weiming Ren, Aaran Arulraj, Xuan He, Ziyan Jiang, Tianle Li, Max Ku, Kai Wang, Alex Zhuang, Rongqi Fan, Xiang Yue, and Wenhu Chen. MMLU-Pro: A more robust and challenging multi-task language understanding benchmark. InAdvances in Neural Information Processing Systems, volume 3...

work page
[35]

URL https://proceedings.neurips.cc/paper_files/paper/2024/file/ ad236edc564f3e3156e1b2feafb99a24-Paper-Datasets_and_Benchmarks_Track. pdf. 16

work page 2024
[36]

BrowseComp: A Simple Yet Challenging Benchmark for Browsing Agents

Jason Wei, Zhiqing Sun, Spencer Papay, Scott McKinney, Jeffrey Han, Isa Fulford, Hyung Won Chung, Alex Tachard Passos, William Fedus, and Amelia Glaese. Browsecomp: A simple yet challenging benchmark for browsing agents.arXiv preprint arXiv:2504.12516, 2025. URL https://arxiv.org/abs/2504.12516

work page internal anchor Pith review Pith/arXiv arXiv 2025
[37]

MiMo-V2.5-Pro, 2026

Xiaomi. MiMo-V2.5-Pro, 2026. URL https://mimo.xiaomi.com/mimo-v2-5-pro . Ac- cessed: 2026-05-06

work page 2026
[38]

Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio, William Cohen, Ruslan Salakhutdinov, and Christopher D. Manning. HotpotQA: A dataset for diverse, explainable multi-hop question answering. InProceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 2369–2380. Association for Computational Linguistics, 2018. doi: 10....

work page 2018
[39]

τ-bench: A benchmark for tool-agent-user interaction in real-world domains

Shunyu Yao, Noah Shinn, Pedram Razavi, and Karthik Narasimhan. τ-bench: A benchmark for tool-agent-user interaction in real-world domains. InThe Thir- teenth International Conference on Learning Representations. OpenReview.net,

work page
[40]

URL https://proceedings.iclr.cc/paper_files/paper/2025/hash/ 1b126cc38b8638e07bef37e7b2bb72bf-Abstract-Conference.html

work page 2025
[41]

GLM-5.1 model card, 2026

Zhipu AI. GLM-5.1 model card, 2026. URL https://www.modelscope.cn/models/ ZhipuAI/GLM-5.1. Accessed: 2026-05-06

work page 2026
[42]

BrowseComp-ZH: Benchmarking Web Browsing Ability of Large Language Models in Chinese

Peilin Zhou, Bruce Leon, Xiang Ying, Can Zhang, Yifan Shao, Qichen Ye, Dading Chong, Zhiling Jin, Chenxuan Xie, Meng Cao, Yuxin Gu, Sixin Hong, Jing Ren, Jian Chen, Chao Liu, and Yining Hua. BrowseComp-ZH: Benchmarking web browsing ability of large language models in chinese.arXiv preprint arXiv:2504.19314, 2025. doi: 10.48550/arXiv.2504.19314. URLhttps:/...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2504.19314 2025
[43]

counter-guarantee

Shuyan Zhou, Frank F. Xu, Hao Zhu, Xuhui Zhou, Robert Lo, Abishek Sridhar, Xianyi Cheng, Tianyue Ou, Yonatan Bisk, Daniel Fried, Uri Alon, and Graham Neubig. WebArena: A realistic web environment for building autonomous agents. InThe Twelfth International Conference on Learning Representations. OpenReview.net, 2024. URL https://openreview.net/forum? id=oK...

work page 2024
[44]

Conduct systematic searches to locate relevant sources

work page
[45]

Visit and analyze authoritative sources carefully

work page
[46]

Synthesize information from multiple sources as needed

work page
[47]

Law”, “Legal Studies

Provide your final answer in <answer></answer> tags Final answer format: <answer>Your answer here</answer> This prompt asks for systematic research without giving domain-specific hints. It matches expert search settings where users have questions but no pre-identified sources. E.3 Scoring Methodology Each question specifies detailed grading criteria with ...

work page

[1] [1]

Qwen3.6-plus: Towards real world agents, 2026

Alibaba Cloud. Qwen3.6-plus: Towards real world agents, 2026. URL https://qwen.ai/ blog?id=qwen3.6. Accessed: 2026-05-06

work page 2026

[2] [2]

What’s new in claude opus 4.7, 2026

Anthropic. What’s new in claude opus 4.7, 2026. URL https://platform.claude.com/ docs/en/about-claude/models/whats-new-claude-4-7. Accessed: 2026-05-06

work page 2026

[3] [3]

Seed1.8 Model Card: Towards Generalized Real-World Agency

ByteDance Seed. Seed1.8 Model Card: Towards generalized real-world agency.arXiv preprint arXiv:2603.20633, 2026. doi: 10.48550/arXiv.2603.20633. URL https://arxiv.org/abs/ 2603.20633

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2603.20633 2026

[4] [4]

Seed2.0 Model Card: Towards intelligence frontier for real-world complexity,

ByteDance Seed. Seed2.0 Model Card: Towards intelligence frontier for real-world complexity,

work page

[5] [5]

Model card

URLhttps://seed.bytedance.com/en/seed2. Model card

work page

[6] [6]

FinQA: A dataset of numerical reasoning over financial data

Zhiyu Chen, Wenhu Chen, Charese Smiley, Sameena Shah, Iana Borova, Dylan Langdon, Reema Moussa, Matt Beane, Ting-Hao Huang, Bryan Routledge, and William Yang Wang. FinQA: A dataset of numerical reasoning over financial data. InProceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 3697–3711. Association for Computati...

work page doi:10.18653/v1/2021.emnlp-main.300 2021

[7] [7]

DeepSeek V4 technical documentation, 2026

DeepSeek AI. DeepSeek V4 technical documentation, 2026. URL https://fe-static. deepseek.com/chat/transparency/deepseek-V4-model-card-EN.pdf . Accessed: 2026-05-06

work page 2026

[8] [8]

DeepSeek-V3.2: Pushing the Frontier of Open Large Language Models

DeepSeek-AI, Aixin Liu, Aoxue Mei, Bangcai Lin, Bing Xue, Bingxuan Wang, Bingzheng Xu, Bochao Wu, Bowei Zhang, Chaofan Lin, et al. DeepSeek-V3.2: Pushing the frontier of open large language models.arXiv preprint arXiv:2512.02556, 2025. doi: 10.48550/arXiv.2512. 02556. URLhttps://arxiv.org/abs/2512.02556. 14

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2512 2025

[9] [9]

Mind2Web: Towards a generalist agent for the web

Xiang Deng, Yu Gu, Boyuan Zheng, Shijie Chen, Sam Stevens, Boshi Wang, Huan Sun, and Yu Su. Mind2Web: Towards a generalist agent for the web. InAdvances in Neu- ral Information Processing Systems, volume 36, pages 28091–28114. Curran Associates, Inc., 2023. URL https://proceedings.neurips.cc/paper_files/paper/2023/file/ 5950bf290a1570ea401bf98882128160-Pa...

work page 2023

[10] [10]

DeepResearch Bench: A Comprehensive Benchmark for Deep Research Agents

Mingxuan Du, Benfeng Xu, Chiwei Zhu, Xiaorui Wang, and Zhendong Mao. DeepResearch Bench: A comprehensive benchmark for deep research agents.arXiv preprint arXiv:2506.11763,

work page internal anchor Pith review Pith/arXiv arXiv

[11] [11]

DeepResearch Bench: A Comprehensive Benchmark for Deep Research Agents

doi: 10.48550/arXiv.2506.11763. URLhttps://arxiv.org/abs/2506.11763

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2506.11763

[12] [12]

Gemini 3.1 pro model card, 2026

Google DeepMind. Gemini 3.1 pro model card, 2026. URL https://deepmind.google/ models/model-cards/gemini-3-1-pro/. Accessed: 2026-05-06

work page 2026

[13] [13]

LegalBench: A collaboratively built benchmark for measuring legal reasoning in large language models

Neel Guha, Julian Nyarko, Daniel Ho, Christopher Ré, Adam Chilton, Aditya K, Alex Chohlas- Wood, Austin Peters, Brandon Waldon, Daniel Rockmore, et al. LegalBench: A collaboratively built benchmark for measuring legal reasoning in large language models. InAdvances in Neural Information Processing Systems, volume 36, pages 44123–44279. Curran Associates, I...

work page 2023

[14] [14]

DeepSearchQA: Bridging the comprehensiveness gap for deep research agents.arXiv preprint arXiv:2601.20975, 2026

Nikita Gupta, Riju Chatterjee, Lukas Haas, Connie Tao, Andrew Wang, Chang Liu, Hidekazu Oiwa, Elena Gribovskaya, Jan Ackermann, John Blitzer, Sasha Goldshtein, and Dipanjan Das. DeepSearchQA: Bridging the comprehensiveness gap for deep research agents.arXiv preprint arXiv:2601.20975, 2026. URLhttps://arxiv.org/abs/2601.20975

work page arXiv 2026

[15] [15]

PubMedQA: A dataset for biomedical research question answering

Qiao Jin, Bhuwan Dhingra, Zhengping Liu, William Cohen, and Xinghua Lu. PubMedQA: A dataset for biomedical research question answering. InProceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, pages 2567–2577. Association for Computational Linguistics,

work page 2019

[16] [16]

URLhttps://aclanthology.org/D19-1259/

doi: 10.18653/v1/D19-1259. URLhttps://aclanthology.org/D19-1259/

work page doi:10.18653/v1/d19-1259

[17] [17]

Joshi, E

Mandar Joshi, Eunsol Choi, Daniel Weld, and Luke Zettlemoyer. TriviaQA: A large scale distantly supervised challenge dataset for reading comprehension. InProceedings of the 55th Annual Meeting of the Association for Computational Linguistics, pages 1601–1611. Association for Computational Linguistics, 2017. doi: 10.18653/v1/P17-1147. URL https: //aclantho...

work page doi:10.18653/v1/p17-1147 2017

[18] [18]

ResearchArena: Benchmarking large language models’ ability to collect and organize information as research agents

Hao Kang and Chenyan Xiong. ResearchArena: Benchmarking large language models’ ability to collect and organize information as research agents. InFindings of the Association for Computational Linguistics: EMNLP 2025, pages 5653–5671. Association for Computational Linguistics, 2025. doi: 10.18653/v1/2025.findings-emnlp.303. URL https://aclanthology. org/202...

work page doi:10.18653/v1/2025.findings-emnlp.303 2025

[19] [19]

Kimi Team, Tongtong Bai, Yifan Bai, Yiping Bao, S. H. Cai, Yuan Cao, Y . Charles, H. S. Che, Cheng Chen, Guanduo Chen, et al. Kimi K2.5: Visual agentic intelligence.arXiv preprint arXiv:2602.02276, 2026. doi: 10.48550/arXiv.2602.02276. URL https://arxiv.org/abs/ 2602.02276

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2602.02276 2026

[20] [20]

ISBN 979-8-89176-256-5

Satyapriya Krishna, Kalpesh Krishna, Anhad Mohananey, Steven Schwarcz, Adam Stambler, Shyam Upadhyay, and Manaal Faruqui. Fact, fetch, and reason: A unified evaluation of retrieval- augmented generation. InProceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies...

work page doi:10.18653/v1/2025 2025

[21] [21]

Natural questions: A benchmark for question answering research.Transactions of the Association for Computational Linguistics, 7:452–466, 2019

Tom Kwiatkowski, Jennimaria Palomaki, Olivia Redfield, Michael Collins, Ankur Parikh, Chris Alberti, Danielle Epstein, Illia Polosukhin, Jacob Devlin, Kenton Lee, et al. Natural questions: A benchmark for question answering research.Transactions of the Association for Computational Linguistics, 7:452–466, 2019. doi: 10.1162/tacl_a_00276. URL https: //acla...

work page doi:10.1162/tacl_a_00276 2019

[22] [22]

AgentBench: Eval- uating LLMs as agents

Xiao Liu, Hao Yu, Hanchen Zhang, Yifan Xu, Xuanyu Lei, Hanyu Lai, Yu Gu, Hangliang Ding, Kaiwen Men, Kejuan Yang, Shudan Zhang, Xiang Deng, Aohan Zeng, Zhengxiao Du, Chenhui Zhang, Sheng Shen, Tianjun Zhang, Yu Su, Huan Sun, Minlie Huang, et al. AgentBench: Eval- uating LLMs as agents. InThe Twelfth International Conference on Learning Representations. Op...

work page 2024

[23] [23]

SuperGPQA: Scaling LLM Evaluation across 285 Graduate Disciplines

M-A-P Team, Xinrun Du, Yifan Yao, Kaijing Ma, Bingli Wang, Tianyu Zheng, King Zhu, Minghao Liu, Yiming Liang, Xiaolong Jin, et al. SuperGPQA: Scaling LLM evaluation across 285 graduate disciplines.arXiv preprint arXiv:2502.14739, 2025. doi: 10.48550/arXiv.2502. 14739. URLhttps://arxiv.org/abs/2502.14739

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2502 2025

[24] [24]

GAIA: A benchmark for general AI assistants

Grégoire Mialon, Clémentine Fourrier, Thomas Wolf, Yann LeCun, and Thomas Scialom. GAIA: A benchmark for general AI assistants. InThe Twelfth International Conference on Learning Representations. OpenReview.net, 2024. URL https://openreview.net/forum? id=fibxvahvs3

work page 2024

[25] [25]

MiniMax M2.7: Early echoes of self-evolution, 2026

MiniMax. MiniMax M2.7: Early echoes of self-evolution, 2026. URLhttps://www.minimax. io/news/minimax-m27-en. Accessed: 2026-05-06

work page 2026

[26] [26]

Kimi K2.6 tech blog: Advancing open-source coding, 2026

Moonshot AI. Kimi K2.6 tech blog: Advancing open-source coding, 2026. URL https: //www.kimi.com/blog/kimi-k2-6. Accessed: 2026-05-06

work page 2026

[27] [27]

GPT-5.5 system card, 2026

OpenAI. GPT-5.5 system card, 2026. URL https://openai.com/index/ gpt-5-5-system-card/. Accessed: 2026-05-06

work page 2026

[28] [28]

Humanity's Last Exam

Long Phan, Alice Gatti, Nathaniel Li, Adam Khoja, Ryan Kim, Richard Ren, Jason Hausenloy, Oliver Zhang, Mantas Mazeika, Dan Hendrycks, et al. A benchmark of expert-level academic questions to assess AI capabilities.Nature, 649(8099):1139–1146, 2026. doi: 10.1038/ s41586-025-09962-4. URLhttps://doi.org/10.1038/s41586-025-09962-4

work page internal anchor Pith review doi:10.1038/s41586-025-09962-4 2026

[29] [29]

David Rein, Betty Li Hou, Asa Cooper Stickland, Jackson Petty, Richard Yuanzhe Pang, Julien Dirani, Julian Michael, and Samuel R. Bowman. GPQA: A graduate-level google- proof q&a benchmark. InThe Twelfth International Conference on Learning Representations. OpenReview.net, 2024. URLhttps://openreview.net/forum?id=TGnEMgh4PB

work page 2024

[30] [30]

RAGChecker: A fine-grained framework for diagnosing retrieval-augmented generation

Dongyu Ru, Lin Qiu, Xiangkun Hu, Tianhang Zhang, Peng Shi, Shuaichen Chang, Cheng Jiayang, Cunxiang Wang, Shichao Sun, Huanyu Li, Zizhao Zhang, Binjie Wang, Jiarong Jiang, Tong He, Zhiguo Wang, Pengfei Liu, Yue Zhang, and Zheng Zhang. RAGChecker: A fine-grained framework for diagnosing retrieval-augmented generation. InAdvances in Neural Information Proce...

work page

[31] [31]

URL https://proceedings.neurips.cc/paper_ files/paper/2024/file/27245589131d17368cccdfa990cbf16e-Paper-Datasets_ and_Benchmarks_Track.pdf

doi: 10.52202/079017-0692. URL https://proceedings.neurips.cc/paper_ files/paper/2024/file/27245589131d17368cccdfa990cbf16e-Paper-Datasets_ and_Benchmarks_Track.pdf

work page doi:10.52202/079017-0692 2024

[32] [32]

PaperBench: Evaluating AI's Ability to Replicate AI Research

Giulio Starace, Oliver Jaffe, Dane Sherburn, James Aung, Jun Shern Chan, Leon Maksin, Rachel Dias, Evan Mays, Benjamin Kinsella, Wyatt Thompson, Johannes Heidecke, Amelia Glaese, and Tejal Patwardhan. PaperBench: Evaluating AI’s ability to replicate AI research. arXiv preprint arXiv:2504.01848, 2025. doi: 10.48550/arXiv.2504.01848. URL https: //arxiv.org/...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2504.01848 2025

[33] [33]

FreshLLMs: Refreshing large language models with search engine augmentation.arXiv preprint arXiv:2310.03214, 2023

Tu Vu, Mohit Iyyer, Xuezhi Wang, Noah Constant, Jerry Wei, Jason Wei, Chris Tar, Yun-Hsuan Sung, Denny Zhou, Quoc Le, and Thang Luong. FreshLLMs: Refreshing large language models with search engine augmentation.arXiv preprint arXiv:2310.03214, 2023. URL https://arxiv.org/abs/2310.03214

work page arXiv 2023

[34] [34]

MMLU-Pro: A more robust and challenging multi-task language understanding benchmark

Yubo Wang, Xueguang Ma, Ge Zhang, Yuansheng Ni, Abhranil Chandra, Shiguang Guo, Weiming Ren, Aaran Arulraj, Xuan He, Ziyan Jiang, Tianle Li, Max Ku, Kai Wang, Alex Zhuang, Rongqi Fan, Xiang Yue, and Wenhu Chen. MMLU-Pro: A more robust and challenging multi-task language understanding benchmark. InAdvances in Neural Information Processing Systems, volume 3...

work page

[35] [35]

URL https://proceedings.neurips.cc/paper_files/paper/2024/file/ ad236edc564f3e3156e1b2feafb99a24-Paper-Datasets_and_Benchmarks_Track. pdf. 16

work page 2024

[36] [36]

BrowseComp: A Simple Yet Challenging Benchmark for Browsing Agents

Jason Wei, Zhiqing Sun, Spencer Papay, Scott McKinney, Jeffrey Han, Isa Fulford, Hyung Won Chung, Alex Tachard Passos, William Fedus, and Amelia Glaese. Browsecomp: A simple yet challenging benchmark for browsing agents.arXiv preprint arXiv:2504.12516, 2025. URL https://arxiv.org/abs/2504.12516

work page internal anchor Pith review Pith/arXiv arXiv 2025

[37] [37]

MiMo-V2.5-Pro, 2026

Xiaomi. MiMo-V2.5-Pro, 2026. URL https://mimo.xiaomi.com/mimo-v2-5-pro . Ac- cessed: 2026-05-06

work page 2026

[38] [38]

Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio, William Cohen, Ruslan Salakhutdinov, and Christopher D. Manning. HotpotQA: A dataset for diverse, explainable multi-hop question answering. InProceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 2369–2380. Association for Computational Linguistics, 2018. doi: 10....

work page 2018

[39] [39]

τ-bench: A benchmark for tool-agent-user interaction in real-world domains

Shunyu Yao, Noah Shinn, Pedram Razavi, and Karthik Narasimhan. τ-bench: A benchmark for tool-agent-user interaction in real-world domains. InThe Thir- teenth International Conference on Learning Representations. OpenReview.net,

work page

[40] [40]

URL https://proceedings.iclr.cc/paper_files/paper/2025/hash/ 1b126cc38b8638e07bef37e7b2bb72bf-Abstract-Conference.html

work page 2025

[41] [41]

GLM-5.1 model card, 2026

Zhipu AI. GLM-5.1 model card, 2026. URL https://www.modelscope.cn/models/ ZhipuAI/GLM-5.1. Accessed: 2026-05-06

work page 2026

[42] [42]

BrowseComp-ZH: Benchmarking Web Browsing Ability of Large Language Models in Chinese

Peilin Zhou, Bruce Leon, Xiang Ying, Can Zhang, Yifan Shao, Qichen Ye, Dading Chong, Zhiling Jin, Chenxuan Xie, Meng Cao, Yuxin Gu, Sixin Hong, Jing Ren, Jian Chen, Chao Liu, and Yining Hua. BrowseComp-ZH: Benchmarking web browsing ability of large language models in chinese.arXiv preprint arXiv:2504.19314, 2025. doi: 10.48550/arXiv.2504.19314. URLhttps:/...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2504.19314 2025

[43] [43]

counter-guarantee

Shuyan Zhou, Frank F. Xu, Hao Zhu, Xuhui Zhou, Robert Lo, Abishek Sridhar, Xianyi Cheng, Tianyue Ou, Yonatan Bisk, Daniel Fried, Uri Alon, and Graham Neubig. WebArena: A realistic web environment for building autonomous agents. InThe Twelfth International Conference on Learning Representations. OpenReview.net, 2024. URL https://openreview.net/forum? id=oK...

work page 2024

[44] [44]

Conduct systematic searches to locate relevant sources

work page

[45] [45]

Visit and analyze authoritative sources carefully

work page

[46] [46]

Synthesize information from multiple sources as needed

work page

[47] [47]

Law”, “Legal Studies

Provide your final answer in <answer></answer> tags Final answer format: <answer>Your answer here</answer> This prompt asks for systematic research without giving domain-specific hints. It matches expert search settings where users have questions but no pre-identified sources. E.3 Scoring Methodology Each question specifies detailed grading criteria with ...

work page