Teaching AI Through Benchmark Construction: QuestBench as a Course-Based Practice for Accountable Knowledge Work

Chongyang Pan; Haiyang Shen; Jiuzheng Wang; Mugeng Liu; Siqi Zhong; Taian Guo; Weichen Bi; Wenchun Jing; Xiaoying Bai; Yudong Han

arxiv: 2605.21413 · v2 · pith:FJYSECMSnew · submitted 2026-05-20 · 💻 cs.AI

Teaching AI Through Benchmark Construction: QuestBench as a Course-Based Practice for Accountable Knowledge Work

Haiyang Shen , Jiuzheng Wang , Taian Guo , Mugeng Liu , Wenchun Jing , Chongyang Pan , Siqi Zhong , Zhiyang Chen

show 4 more authors

Weichen Bi Yudong Han Xiaoying Bai Yun Ma

This is my paper

Pith reviewed 2026-05-22 09:39 UTC · model grok-4.3

classification 💻 cs.AI

keywords AI educationbenchmark constructiondeep research systemsQuestBenchstudent evaluationaccountable knowledge workhumanities benchmarks

0 comments

The pith

Student-designed questions show AI deep research systems pass only 17 percent of expert tasks on average.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that AI education benefits from students actively constructing benchmarks to test systems rather than only learning to prompt and use them. Students convert their disciplinary knowledge into verifiable questions, peer-review the designs to remove ambiguity and shortcuts, then run multiple AI systems against the resulting tasks. This produces QuestBench, a set of 256 questions spanning 14 humanities and social-science fields, on which thirteen systems average a 16.85 percent question-level pass rate. The low scores demonstrate that fluent, source-backed AI responses can still miss required queries, sources, terms, or evidence standards. Student reflections indicate the exercise teaches them to treat professional knowledge as the standard for judging AI outputs instead of treating AI as a simple retrieval tool.

Core claim

QuestBench is generated when students turn domain knowledge into expert-level questions that peers review for clarity and completeness; evaluation of thirteen deep research systems on the 256 questions yields a mean pass rate of 16.85 percent, with the strongest system, GPT-5.5, reaching 57.58 percent. These results show that even answers backed by sources can fail on precise query formulation, source choice, terminology, or evidence requirements. Reflections from student contributors indicate that constructing and applying the benchmark helps them see disciplinary expertise as the basis for evaluating AI rather than as content that machines simply fetch.

What carries the argument

The classroom benchmark-construction cycle in which students create, review, and apply expert questions to evaluate AI deep research systems, producing both the QuestBench dataset and direct experience of defining trustworthy answer standards.

If this is right

Fluent AI answers can still fail expert tasks by selecting the wrong query focus, source, term, or evidence level.
Students gain practice specifying what counts as a trustworthy answer when they design and critique the questions.
The activity supplies a reusable classroom format that turns AI evaluation into an educational exercise rather than a black-box use.
Reflections show students come to view their own knowledge as the criterion for assessing machine outputs.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same student-led construction method could be tested in STEM or professional-training settings to map AI failure modes across different knowledge domains.
Collected student questions could serve as additional evaluation or fine-tuning data to improve future deep research systems.
Widespread classroom use might gradually shift AI literacy curricula from usage skills toward explicit training in output judgment.

Load-bearing premise

Student-created and peer-reviewed questions form unbiased expert-level tests whose failures reflect genuine AI limitations rather than flaws in question design or grading criteria.

What would settle it

If the same AI systems were tested on an equivalent set of questions written and reviewed by practicing domain experts and produced markedly higher pass rates, the claim that student benchmarks reliably expose AI shortcomings would be undermined.

Figures

Figures reproduced from arXiv: 2605.21413 by Chongyang Pan, Haiyang Shen, Jiuzheng Wang, Mugeng Liu, Siqi Zhong, Taian Guo, Weichen Bi, Wenchun Jing, Xiaoying Bai, Yudong Han, Yun Ma, Zhiyang Chen.

**Figure 1.** Figure 1: Conceptual framework of QUESTBENCH as course-based benchmark construction for teaching accountable AI-mediated knowledge work. Students first encounter deep research systems as a practical tool, then use benchmark construction to design expert-level questions, test shortcuts, validate answers, evaluate models, and analyze failures. The course links tool exposure with question design, disciplinary standards… view at source ↗

**Figure 2.** Figure 2: Course and technical pipeline for QUESTBENCH. Students transform disciplinary knowledge into expert-level question packages, then filter them through preliminary screening, answer verification, gradingcriteria audit, anti-shortcut validation, and domain normalization. The same artifacts are then used for model evaluation, scoring, and failure analysis, turning task design into practice in accountable AI-m… view at source ↗

**Figure 3.** Figure 3: Left: Distribution across normalized domain groups. Right: Empirical cross-model question pass-rate [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗

**Figure 4.** Figure 4: Left: score distributions across models. Right: average tool calls and fraction of runs exceeding 50 [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗

**Figure 5.** Figure 5: Domain and error analysis. Left: mean scores for the largest normalized domains across the thirteen [PITH_FULL_IMAGE:figures/full_fig_p010_5.png] view at source ↗

read the original abstract

As AI becomes part of everyday learning, many courses teach students to use it mainly as a productivity tool: how to prompt, search, summarize, write, code, and use tools more efficiently. We argue that AI education also needs a setting in which students learn to test AI and understand their own role in judging machine-produced knowledge. To this end, we introduce a course-based practice that teaches AI through benchmark construction, using deep research systems as a concrete example of AI-era knowledge work. Students turn disciplinary knowledge into verifiable expert-level questions, review one another's designs for ambiguity and shortcuts, and evaluate AI systems on the resulting tasks. This activity gives students direct exposure to a powerful tool while asking them to specify what a trustworthy answer would require. The produced benchmark, QuestBench, consists of 256 questions across 14 humanities and social-science domains. Evaluation on QuestBench shows that student-designed tasks reveal hidden failures in current deep research systems: across thirteen evaluated systems, the mean question-level pass rate is only 16.85%, and the best-performing system, GPT-5.5, reaches a 57.58% pass rate. The failures are educationally useful because they show how fluent, source-backed answers can still miss the right query, source, term, or evidence standard. Reflections from five student contributors suggest that benchmark construction can help students see professional knowledge not only as content AI may retrieve, but as the basis for judging AI outputs. We present QuestBench as a benchmark artifact and as a reusable classroom setting for a larger educational question: how students can remain responsible knowledge actors as AI enters learning and professional work. The dataset is available at https://huggingface.co/datasets/PKUAIWeb/QuestBench/tree/main.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces QuestBench, a benchmark of 256 student-constructed questions across 14 humanities and social-science domains, created via a course activity in which students draft expert-level questions from disciplinary knowledge, perform mutual peer reviews for ambiguity and shortcuts, and then evaluate 13 AI deep-research systems. It reports a mean question-level pass rate of 16.85% (best system GPT-5.5 at 57.58%), argues that these results expose hidden limitations in current systems, and presents the activity as an educational practice for teaching accountable knowledge work. The dataset is released publicly.

Significance. If the questions are shown to be unambiguous and calibrated to expert standards, the work supplies both a reusable benchmark artifact and a concrete classroom method that shifts AI education from tool-use instruction toward critical evaluation of machine-generated knowledge. The low pass rates, if robust, would constitute falsifiable evidence of current limitations in source selection, evidence standards, and query interpretation for non-STEM research tasks.

major comments (2)

[Abstract] Abstract: The headline claim that student-designed tasks reveal genuine AI limitations (mean 16.85% pass rate) is load-bearing on the assumption that the 256 questions are free of ambiguities, shortcuts, and design flaws. The described construction process relies exclusively on student drafting and mutual reviews; no inter-annotator agreement statistics, external domain-expert validation, or quantitative difficulty calibration are reported. This directly affects whether the observed failures can be attributed to the AI systems rather than to question wording or evaluation criteria.
[Evaluation] Evaluation description: The exact operational definition of a 'pass' (e.g., whether partial credit, source citation requirements, or human judgment rubrics are used) is not specified. Without this, the reported rates (including the 57.58% figure for GPT-5.5) cannot be independently verified or compared across systems.

minor comments (2)

[Abstract] The abstract states 14 domains but provides no list or distribution; adding a brief table or enumeration would improve clarity.
The reflections from five student contributors are referenced but not excerpted; including one or two concrete examples of how benchmark construction changed their view of AI would strengthen the educational argument without lengthening the paper substantially.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their detailed and constructive comments. We address the major concerns regarding the robustness of the benchmark construction and the clarity of the evaluation protocol below. Where appropriate, we will revise the manuscript to incorporate additional details and clarifications.

read point-by-point responses

Referee: [Abstract] Abstract: The headline claim that student-designed tasks reveal genuine AI limitations (mean 16.85% pass rate) is load-bearing on the assumption that the 256 questions are free of ambiguities, shortcuts, and design flaws. The described construction process relies exclusively on student drafting and mutual reviews; no inter-annotator agreement statistics, external domain-expert validation, or quantitative difficulty calibration are reported. This directly affects whether the observed failures can be attributed to the AI systems rather than to question wording or evaluation criteria.

Authors: We acknowledge the importance of demonstrating that the questions are unambiguous and aligned with expert standards. The peer review process was intended to mitigate ambiguities and shortcuts, with students instructed to flag issues in each other's questions. However, we did not collect or report inter-annotator agreement statistics, nor did we obtain validation from external domain experts outside the student cohort. Quantitative difficulty calibration was not performed a priori. These are valid points. In the revised version, we will provide a more detailed account of the peer review guidelines and process, include examples of revised questions, and explicitly discuss these as limitations of the current study, proposing external validation as future work. We maintain that the low pass rates, even if some questions have minor issues, still indicate challenges for AI systems, but we will tone down the claim to reflect the construction method. revision: partial
Referee: [Evaluation] Evaluation description: The exact operational definition of a 'pass' (e.g., whether partial credit, source citation requirements, or human judgment rubrics are used) is not specified. Without this, the reported rates (including the 57.58% figure for GPT-5.5) cannot be independently verified or compared across systems.

Authors: We agree that the operational definition of a 'pass' must be clearly specified for reproducibility. We will add a dedicated subsection in the revised manuscript that provides the exact operational definition of a 'pass', including details on the human judgment rubric, source citation requirements, and handling of partial answers. This clarification will enable independent verification and comparison. revision: yes

Circularity Check

0 steps flagged

No circularity: direct empirical evaluation on newly constructed benchmark

full rationale

The paper describes a classroom activity in which students create 256 questions across 14 domains, perform peer reviews for ambiguity and shortcuts, and then run direct evaluations of 13 AI systems, reporting observed pass rates (mean 16.85%, best system 57.58%). No equations, fitted parameters, predictions derived from prior fits, or self-citation chains appear in the provided text. The central claims are straightforward measurements on an externally released dataset; they do not reduce to the inputs by construction or rely on unverified self-referential premises.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the assumption that student-constructed questions validly probe AI capabilities; no free parameters or invented entities are introduced.

axioms (1)

domain assumption Student-designed and peer-reviewed questions can serve as reliable expert-level tests for AI deep research systems.
Invoked when claiming that low pass rates reveal hidden failures rather than question artifacts.

pith-pipeline@v0.9.0 · 5892 in / 1267 out tokens · 55507 ms · 2026-05-22T09:39:12.425645+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Reference graph

Works this paper leans on

47 extracted references · 47 canonical work pages · 10 internal anchors

[1]

Qwen3.6-plus: Towards real world agents, 2026

Alibaba Cloud. Qwen3.6-plus: Towards real world agents, 2026. URL https://qwen.ai/ blog?id=qwen3.6. Accessed: 2026-05-06

work page 2026
[2]

What’s new in claude opus 4.7, 2026

Anthropic. What’s new in claude opus 4.7, 2026. URL https://platform.claude.com/ docs/en/about-claude/models/whats-new-claude-4-7. Accessed: 2026-05-06

work page 2026
[3]

Seed1.8 Model Card: Towards Generalized Real-World Agency

ByteDance Seed. Seed1.8 Model Card: Towards generalized real-world agency.arXiv preprint arXiv:2603.20633, 2026. doi: 10.48550/arXiv.2603.20633. URL https://arxiv.org/abs/ 2603.20633

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2603.20633 2026
[4]

Seed2.0 Model Card: Towards intelligence frontier for real-world complexity,

ByteDance Seed. Seed2.0 Model Card: Towards intelligence frontier for real-world complexity,

work page
[5]

Model card

URLhttps://seed.bytedance.com/en/seed2. Model card

work page
[6]

FinQA: A dataset of numerical reasoning over financial data

Zhiyu Chen, Wenhu Chen, Charese Smiley, Sameena Shah, Iana Borova, Dylan Langdon, Reema Moussa, Matt Beane, Ting-Hao Huang, Bryan Routledge, and William Yang Wang. FinQA: A dataset of numerical reasoning over financial data. InProceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 3697–3711. Association for Computati...

work page doi:10.18653/v1/2021.emnlp-main.300 2021
[7]

DeepSeek V4 technical documentation, 2026

DeepSeek AI. DeepSeek V4 technical documentation, 2026. URL https://fe-static. deepseek.com/chat/transparency/deepseek-V4-model-card-EN.pdf . Accessed: 2026-05-06

work page 2026
[8]

DeepSeek-V3.2: Pushing the Frontier of Open Large Language Models

DeepSeek-AI, Aixin Liu, Aoxue Mei, Bangcai Lin, Bing Xue, Bingxuan Wang, Bingzheng Xu, Bochao Wu, Bowei Zhang, Chaofan Lin, et al. DeepSeek-V3.2: Pushing the frontier of open large language models.arXiv preprint arXiv:2512.02556, 2025. doi: 10.48550/arXiv.2512. 02556. URLhttps://arxiv.org/abs/2512.02556. 14

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2512 2025
[9]

Mind2Web: Towards a generalist agent for the web

Xiang Deng, Yu Gu, Boyuan Zheng, Shijie Chen, Sam Stevens, Boshi Wang, Huan Sun, and Yu Su. Mind2Web: Towards a generalist agent for the web. InAdvances in Neu- ral Information Processing Systems, volume 36, pages 28091–28114. Curran Associates, Inc., 2023. URL https://proceedings.neurips.cc/paper_files/paper/2023/file/ 5950bf290a1570ea401bf98882128160-Pa...

work page 2023
[10]

DeepResearch Bench: A Comprehensive Benchmark for Deep Research Agents

Mingxuan Du, Benfeng Xu, Chiwei Zhu, Xiaorui Wang, and Zhendong Mao. DeepResearch Bench: A comprehensive benchmark for deep research agents.arXiv preprint arXiv:2506.11763,

work page internal anchor Pith review Pith/arXiv arXiv
[11]

DeepResearch Bench: A Comprehensive Benchmark for Deep Research Agents

doi: 10.48550/arXiv.2506.11763. URLhttps://arxiv.org/abs/2506.11763

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2506.11763
[12]

Gemini 3.1 pro model card, 2026

Google DeepMind. Gemini 3.1 pro model card, 2026. URL https://deepmind.google/ models/model-cards/gemini-3-1-pro/. Accessed: 2026-05-06

work page 2026
[13]

LegalBench: A collaboratively built benchmark for measuring legal reasoning in large language models

Neel Guha, Julian Nyarko, Daniel Ho, Christopher Ré, Adam Chilton, Aditya K, Alex Chohlas- Wood, Austin Peters, Brandon Waldon, Daniel Rockmore, et al. LegalBench: A collaboratively built benchmark for measuring legal reasoning in large language models. InAdvances in Neural Information Processing Systems, volume 36, pages 44123–44279. Curran Associates, I...

work page 2023
[14]

DeepSearchQA: Bridging the comprehensiveness gap for deep research agents.arXiv preprint arXiv:2601.20975, 2026

Nikita Gupta, Riju Chatterjee, Lukas Haas, Connie Tao, Andrew Wang, Chang Liu, Hidekazu Oiwa, Elena Gribovskaya, Jan Ackermann, John Blitzer, Sasha Goldshtein, and Dipanjan Das. DeepSearchQA: Bridging the comprehensiveness gap for deep research agents.arXiv preprint arXiv:2601.20975, 2026. URLhttps://arxiv.org/abs/2601.20975

work page arXiv 2026
[15]

PubMedQA: A dataset for biomedical research question answering

Qiao Jin, Bhuwan Dhingra, Zhengping Liu, William Cohen, and Xinghua Lu. PubMedQA: A dataset for biomedical research question answering. InProceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, pages 2567–2577. Association for Computational Linguistics,

work page 2019
[16]

PubMedQA: A Dataset for Biomedical Research Question Answering

doi: 10.18653/v1/D19-1259. URLhttps://aclanthology.org/D19-1259/

work page doi:10.18653/v1/d19-1259
[17]

TriviaQA: A large scale distantly supervised challenge dataset for reading comprehension

Mandar Joshi, Eunsol Choi, Daniel Weld, and Luke Zettlemoyer. TriviaQA: A large scale distantly supervised challenge dataset for reading comprehension. InProceedings of the 55th Annual Meeting of the Association for Computational Linguistics, pages 1601–1611. Association for Computational Linguistics, 2017. doi: 10.18653/v1/P17-1147. URL https: //aclantho...

work page doi:10.18653/v1/p17-1147 2017
[18]

ResearchArena: Benchmarking large language models’ ability to collect and organize information as research agents

Hao Kang and Chenyan Xiong. ResearchArena: Benchmarking large language models’ ability to collect and organize information as research agents. InFindings of the Association for Computational Linguistics: EMNLP 2025, pages 5653–5671. Association for Computational Linguistics, 2025. doi: 10.18653/v1/2025.findings-emnlp.303. URL https://aclanthology. org/202...

work page doi:10.18653/v1/2025.findings-emnlp.303 2025
[19]

Kimi Team, Tongtong Bai, Yifan Bai, Yiping Bao, S. H. Cai, Yuan Cao, Y . Charles, H. S. Che, Cheng Chen, Guanduo Chen, et al. Kimi K2.5: Visual agentic intelligence.arXiv preprint arXiv:2602.02276, 2026. doi: 10.48550/arXiv.2602.02276. URL https://arxiv.org/abs/ 2602.02276

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2602.02276 2026
[20]

In: Christodoulopoulos, C., Chakraborty, T., Rose, C., Peng, V

Satyapriya Krishna, Kalpesh Krishna, Anhad Mohananey, Steven Schwarcz, Adam Stambler, Shyam Upadhyay, and Manaal Faruqui. Fact, fetch, and reason: A unified evaluation of retrieval- augmented generation. InProceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies...

work page doi:10.18653/v1/2025 2025
[21]

Dai, Jakob Uszkoreit, Quoc Le, and Slav Petrov

Tom Kwiatkowski, Jennimaria Palomaki, Olivia Redfield, Michael Collins, Ankur Parikh, Chris Alberti, Danielle Epstein, Illia Polosukhin, Jacob Devlin, Kenton Lee, et al. Natural questions: A benchmark for question answering research.Transactions of the Association for Computational Linguistics, 7:452–466, 2019. doi: 10.1162/tacl_a_00276. URL https: //acla...

work page doi:10.1162/tacl_a_00276 2019
[22]

AgentBench: Eval- uating LLMs as agents

Xiao Liu, Hao Yu, Hanchen Zhang, Yifan Xu, Xuanyu Lei, Hanyu Lai, Yu Gu, Hangliang Ding, Kaiwen Men, Kejuan Yang, Shudan Zhang, Xiang Deng, Aohan Zeng, Zhengxiao Du, Chenhui Zhang, Sheng Shen, Tianjun Zhang, Yu Su, Huan Sun, Minlie Huang, et al. AgentBench: Eval- uating LLMs as agents. InThe Twelfth International Conference on Learning Representations. Op...

work page 2024
[23]

SuperGPQA: Scaling LLM Evaluation across 285 Graduate Disciplines

M-A-P Team, Xinrun Du, Yifan Yao, Kaijing Ma, Bingli Wang, Tianyu Zheng, King Zhu, Minghao Liu, Yiming Liang, Xiaolong Jin, et al. SuperGPQA: Scaling LLM evaluation across 285 graduate disciplines.arXiv preprint arXiv:2502.14739, 2025. doi: 10.48550/arXiv.2502. 14739. URLhttps://arxiv.org/abs/2502.14739

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2502 2025
[24]

GAIA: A benchmark for general AI assistants

Grégoire Mialon, Clémentine Fourrier, Thomas Wolf, Yann LeCun, and Thomas Scialom. GAIA: A benchmark for general AI assistants. InThe Twelfth International Conference on Learning Representations. OpenReview.net, 2024. URL https://openreview.net/forum? id=fibxvahvs3

work page 2024
[25]

MiniMax M2.7: Early echoes of self-evolution, 2026

MiniMax. MiniMax M2.7: Early echoes of self-evolution, 2026. URLhttps://www.minimax. io/news/minimax-m27-en. Accessed: 2026-05-06

work page 2026
[26]

Kimi K2.6 tech blog: Advancing open-source coding, 2026

Moonshot AI. Kimi K2.6 tech blog: Advancing open-source coding, 2026. URL https: //www.kimi.com/blog/kimi-k2-6. Accessed: 2026-05-06

work page 2026
[27]

GPT-5.5 system card, 2026

OpenAI. GPT-5.5 system card, 2026. URL https://openai.com/index/ gpt-5-5-system-card/. Accessed: 2026-05-06

work page 2026
[28]

Humanity's Last Exam

Long Phan, Alice Gatti, Nathaniel Li, Adam Khoja, Ryan Kim, Richard Ren, Jason Hausenloy, Oliver Zhang, Mantas Mazeika, Dan Hendrycks, et al. A benchmark of expert-level academic questions to assess AI capabilities.Nature, 649(8099):1139–1146, 2026. doi: 10.1038/ s41586-025-09962-4. URLhttps://doi.org/10.1038/s41586-025-09962-4

work page internal anchor Pith review doi:10.1038/s41586-025-09962-4 2026
[29]

David Rein, Betty Li Hou, Asa Cooper Stickland, Jackson Petty, Richard Yuanzhe Pang, Julien Dirani, Julian Michael, and Samuel R. Bowman. GPQA: A graduate-level google- proof q&a benchmark. InThe Twelfth International Conference on Learning Representations. OpenReview.net, 2024. URLhttps://openreview.net/forum?id=TGnEMgh4PB

work page 2024
[30]

RAGChecker: A fine-grained framework for diagnosing retrieval-augmented generation

Dongyu Ru, Lin Qiu, Xiangkun Hu, Tianhang Zhang, Peng Shi, Shuaichen Chang, Cheng Jiayang, Cunxiang Wang, Shichao Sun, Huanyu Li, Zizhao Zhang, Binjie Wang, Jiarong Jiang, Tong He, Zhiguo Wang, Pengfei Liu, Yue Zhang, and Zheng Zhang. RAGChecker: A fine-grained framework for diagnosing retrieval-augmented generation. InAdvances in Neural Information Proce...

work page
[31]

URL https://proceedings.neurips.cc/paper_ files/paper/2024/file/27245589131d17368cccdfa990cbf16e-Paper-Datasets_ and_Benchmarks_Track.pdf

doi: 10.52202/079017-0692. URL https://proceedings.neurips.cc/paper_ files/paper/2024/file/27245589131d17368cccdfa990cbf16e-Paper-Datasets_ and_Benchmarks_Track.pdf

work page doi:10.52202/079017-0692 2024
[32]

PaperBench: Evaluating AI's Ability to Replicate AI Research

Giulio Starace, Oliver Jaffe, Dane Sherburn, James Aung, Jun Shern Chan, Leon Maksin, Rachel Dias, Evan Mays, Benjamin Kinsella, Wyatt Thompson, Johannes Heidecke, Amelia Glaese, and Tejal Patwardhan. PaperBench: Evaluating AI’s ability to replicate AI research. arXiv preprint arXiv:2504.01848, 2025. doi: 10.48550/arXiv.2504.01848. URL https: //arxiv.org/...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2504.01848 2025
[33]

Wei, Jason Wei, Chris Tar, Yun-Hsuan Sung, Denny Zhou, Quoc V

Tu Vu, Mohit Iyyer, Xuezhi Wang, Noah Constant, Jerry Wei, Jason Wei, Chris Tar, Yun-Hsuan Sung, Denny Zhou, Quoc Le, and Thang Luong. FreshLLMs: Refreshing large language models with search engine augmentation.arXiv preprint arXiv:2310.03214, 2023. URL https://arxiv.org/abs/2310.03214

work page arXiv 2023
[34]

MMLU-Pro: A more robust and challenging multi-task language understanding benchmark

Yubo Wang, Xueguang Ma, Ge Zhang, Yuansheng Ni, Abhranil Chandra, Shiguang Guo, Weiming Ren, Aaran Arulraj, Xuan He, Ziyan Jiang, Tianle Li, Max Ku, Kai Wang, Alex Zhuang, Rongqi Fan, Xiang Yue, and Wenhu Chen. MMLU-Pro: A more robust and challenging multi-task language understanding benchmark. InAdvances in Neural Information Processing Systems, volume 3...

work page
[35]

URL https://proceedings.neurips.cc/paper_files/paper/2024/file/ ad236edc564f3e3156e1b2feafb99a24-Paper-Datasets_and_Benchmarks_Track. pdf. 16

work page 2024
[36]

BrowseComp: A Simple Yet Challenging Benchmark for Browsing Agents

Jason Wei, Zhiqing Sun, Spencer Papay, Scott McKinney, Jeffrey Han, Isa Fulford, Hyung Won Chung, Alex Tachard Passos, William Fedus, and Amelia Glaese. Browsecomp: A simple yet challenging benchmark for browsing agents.arXiv preprint arXiv:2504.12516, 2025. URL https://arxiv.org/abs/2504.12516

work page internal anchor Pith review Pith/arXiv arXiv 2025
[37]

MiMo-V2.5-Pro, 2026

Xiaomi. MiMo-V2.5-Pro, 2026. URL https://mimo.xiaomi.com/mimo-v2-5-pro . Ac- cessed: 2026-05-06

work page 2026
[38]

Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio, William Cohen, Ruslan Salakhutdinov, and Christopher D. Manning. HotpotQA: A dataset for diverse, explainable multi-hop question answering. InProceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 2369–2380. Association for Computational Linguistics, 2018. doi: 10....

work page 2018
[39]

τ-bench: A benchmark for tool-agent-user interaction in real-world domains

Shunyu Yao, Noah Shinn, Pedram Razavi, and Karthik Narasimhan. τ-bench: A benchmark for tool-agent-user interaction in real-world domains. InThe Thir- teenth International Conference on Learning Representations. OpenReview.net,

work page
[40]

URL https://proceedings.iclr.cc/paper_files/paper/2025/hash/ 1b126cc38b8638e07bef37e7b2bb72bf-Abstract-Conference.html

work page 2025
[41]

GLM-5.1 model card, 2026

Zhipu AI. GLM-5.1 model card, 2026. URL https://www.modelscope.cn/models/ ZhipuAI/GLM-5.1. Accessed: 2026-05-06

work page 2026
[42]

BrowseComp-ZH: Benchmarking Web Browsing Ability of Large Language Models in Chinese

Peilin Zhou, Bruce Leon, Xiang Ying, Can Zhang, Yifan Shao, Qichen Ye, Dading Chong, Zhiling Jin, Chenxuan Xie, Meng Cao, Yuxin Gu, Sixin Hong, Jing Ren, Jian Chen, Chao Liu, and Yining Hua. BrowseComp-ZH: Benchmarking web browsing ability of large language models in chinese.arXiv preprint arXiv:2504.19314, 2025. doi: 10.48550/arXiv.2504.19314. URLhttps:/...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2504.19314 2025
[43]

counter-guarantee

Shuyan Zhou, Frank F. Xu, Hao Zhu, Xuhui Zhou, Robert Lo, Abishek Sridhar, Xianyi Cheng, Tianyue Ou, Yonatan Bisk, Daniel Fried, Uri Alon, and Graham Neubig. WebArena: A realistic web environment for building autonomous agents. InThe Twelfth International Conference on Learning Representations. OpenReview.net, 2024. URL https://openreview.net/forum? id=oK...

work page 2024
[44]

Conduct systematic searches to locate relevant sources

work page
[45]

Visit and analyze authoritative sources carefully

work page
[46]

Synthesize information from multiple sources as needed

work page
[47]

Law”, “Legal Studies

Provide your final answer in <answer></answer> tags Final answer format: <answer>Your answer here</answer> This prompt asks for systematic research without giving domain-specific hints. It matches expert search settings where users have questions but no pre-identified sources. E.3 Scoring Methodology Each question specifies detailed grading criteria with ...

work page

[1] [1]

Qwen3.6-plus: Towards real world agents, 2026

Alibaba Cloud. Qwen3.6-plus: Towards real world agents, 2026. URL https://qwen.ai/ blog?id=qwen3.6. Accessed: 2026-05-06

work page 2026

[2] [2]

What’s new in claude opus 4.7, 2026

Anthropic. What’s new in claude opus 4.7, 2026. URL https://platform.claude.com/ docs/en/about-claude/models/whats-new-claude-4-7. Accessed: 2026-05-06

work page 2026

[3] [3]

Seed1.8 Model Card: Towards Generalized Real-World Agency

ByteDance Seed. Seed1.8 Model Card: Towards generalized real-world agency.arXiv preprint arXiv:2603.20633, 2026. doi: 10.48550/arXiv.2603.20633. URL https://arxiv.org/abs/ 2603.20633

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2603.20633 2026

[4] [4]

Seed2.0 Model Card: Towards intelligence frontier for real-world complexity,

ByteDance Seed. Seed2.0 Model Card: Towards intelligence frontier for real-world complexity,

work page

[5] [5]

Model card

URLhttps://seed.bytedance.com/en/seed2. Model card

work page

[6] [6]

FinQA: A dataset of numerical reasoning over financial data

Zhiyu Chen, Wenhu Chen, Charese Smiley, Sameena Shah, Iana Borova, Dylan Langdon, Reema Moussa, Matt Beane, Ting-Hao Huang, Bryan Routledge, and William Yang Wang. FinQA: A dataset of numerical reasoning over financial data. InProceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 3697–3711. Association for Computati...

work page doi:10.18653/v1/2021.emnlp-main.300 2021

[7] [7]

DeepSeek V4 technical documentation, 2026

DeepSeek AI. DeepSeek V4 technical documentation, 2026. URL https://fe-static. deepseek.com/chat/transparency/deepseek-V4-model-card-EN.pdf . Accessed: 2026-05-06

work page 2026

[8] [8]

DeepSeek-V3.2: Pushing the Frontier of Open Large Language Models

DeepSeek-AI, Aixin Liu, Aoxue Mei, Bangcai Lin, Bing Xue, Bingxuan Wang, Bingzheng Xu, Bochao Wu, Bowei Zhang, Chaofan Lin, et al. DeepSeek-V3.2: Pushing the frontier of open large language models.arXiv preprint arXiv:2512.02556, 2025. doi: 10.48550/arXiv.2512. 02556. URLhttps://arxiv.org/abs/2512.02556. 14

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2512 2025

[9] [9]

Mind2Web: Towards a generalist agent for the web

Xiang Deng, Yu Gu, Boyuan Zheng, Shijie Chen, Sam Stevens, Boshi Wang, Huan Sun, and Yu Su. Mind2Web: Towards a generalist agent for the web. InAdvances in Neu- ral Information Processing Systems, volume 36, pages 28091–28114. Curran Associates, Inc., 2023. URL https://proceedings.neurips.cc/paper_files/paper/2023/file/ 5950bf290a1570ea401bf98882128160-Pa...

work page 2023

[10] [10]

DeepResearch Bench: A Comprehensive Benchmark for Deep Research Agents

Mingxuan Du, Benfeng Xu, Chiwei Zhu, Xiaorui Wang, and Zhendong Mao. DeepResearch Bench: A comprehensive benchmark for deep research agents.arXiv preprint arXiv:2506.11763,

work page internal anchor Pith review Pith/arXiv arXiv

[11] [11]

DeepResearch Bench: A Comprehensive Benchmark for Deep Research Agents

doi: 10.48550/arXiv.2506.11763. URLhttps://arxiv.org/abs/2506.11763

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2506.11763

[12] [12]

Gemini 3.1 pro model card, 2026

Google DeepMind. Gemini 3.1 pro model card, 2026. URL https://deepmind.google/ models/model-cards/gemini-3-1-pro/. Accessed: 2026-05-06

work page 2026

[13] [13]

LegalBench: A collaboratively built benchmark for measuring legal reasoning in large language models

Neel Guha, Julian Nyarko, Daniel Ho, Christopher Ré, Adam Chilton, Aditya K, Alex Chohlas- Wood, Austin Peters, Brandon Waldon, Daniel Rockmore, et al. LegalBench: A collaboratively built benchmark for measuring legal reasoning in large language models. InAdvances in Neural Information Processing Systems, volume 36, pages 44123–44279. Curran Associates, I...

work page 2023

[14] [14]

DeepSearchQA: Bridging the comprehensiveness gap for deep research agents.arXiv preprint arXiv:2601.20975, 2026

Nikita Gupta, Riju Chatterjee, Lukas Haas, Connie Tao, Andrew Wang, Chang Liu, Hidekazu Oiwa, Elena Gribovskaya, Jan Ackermann, John Blitzer, Sasha Goldshtein, and Dipanjan Das. DeepSearchQA: Bridging the comprehensiveness gap for deep research agents.arXiv preprint arXiv:2601.20975, 2026. URLhttps://arxiv.org/abs/2601.20975

work page arXiv 2026

[15] [15]

PubMedQA: A dataset for biomedical research question answering

Qiao Jin, Bhuwan Dhingra, Zhengping Liu, William Cohen, and Xinghua Lu. PubMedQA: A dataset for biomedical research question answering. InProceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, pages 2567–2577. Association for Computational Linguistics,

work page 2019

[16] [16]

PubMedQA: A Dataset for Biomedical Research Question Answering

doi: 10.18653/v1/D19-1259. URLhttps://aclanthology.org/D19-1259/

work page doi:10.18653/v1/d19-1259

[17] [17]

TriviaQA: A large scale distantly supervised challenge dataset for reading comprehension

Mandar Joshi, Eunsol Choi, Daniel Weld, and Luke Zettlemoyer. TriviaQA: A large scale distantly supervised challenge dataset for reading comprehension. InProceedings of the 55th Annual Meeting of the Association for Computational Linguistics, pages 1601–1611. Association for Computational Linguistics, 2017. doi: 10.18653/v1/P17-1147. URL https: //aclantho...

work page doi:10.18653/v1/p17-1147 2017

[18] [18]

ResearchArena: Benchmarking large language models’ ability to collect and organize information as research agents

Hao Kang and Chenyan Xiong. ResearchArena: Benchmarking large language models’ ability to collect and organize information as research agents. InFindings of the Association for Computational Linguistics: EMNLP 2025, pages 5653–5671. Association for Computational Linguistics, 2025. doi: 10.18653/v1/2025.findings-emnlp.303. URL https://aclanthology. org/202...

work page doi:10.18653/v1/2025.findings-emnlp.303 2025

[19] [19]

Kimi Team, Tongtong Bai, Yifan Bai, Yiping Bao, S. H. Cai, Yuan Cao, Y . Charles, H. S. Che, Cheng Chen, Guanduo Chen, et al. Kimi K2.5: Visual agentic intelligence.arXiv preprint arXiv:2602.02276, 2026. doi: 10.48550/arXiv.2602.02276. URL https://arxiv.org/abs/ 2602.02276

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2602.02276 2026

[20] [20]

In: Christodoulopoulos, C., Chakraborty, T., Rose, C., Peng, V

Satyapriya Krishna, Kalpesh Krishna, Anhad Mohananey, Steven Schwarcz, Adam Stambler, Shyam Upadhyay, and Manaal Faruqui. Fact, fetch, and reason: A unified evaluation of retrieval- augmented generation. InProceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies...

work page doi:10.18653/v1/2025 2025

[21] [21]

Dai, Jakob Uszkoreit, Quoc Le, and Slav Petrov

Tom Kwiatkowski, Jennimaria Palomaki, Olivia Redfield, Michael Collins, Ankur Parikh, Chris Alberti, Danielle Epstein, Illia Polosukhin, Jacob Devlin, Kenton Lee, et al. Natural questions: A benchmark for question answering research.Transactions of the Association for Computational Linguistics, 7:452–466, 2019. doi: 10.1162/tacl_a_00276. URL https: //acla...

work page doi:10.1162/tacl_a_00276 2019

[22] [22]

AgentBench: Eval- uating LLMs as agents

Xiao Liu, Hao Yu, Hanchen Zhang, Yifan Xu, Xuanyu Lei, Hanyu Lai, Yu Gu, Hangliang Ding, Kaiwen Men, Kejuan Yang, Shudan Zhang, Xiang Deng, Aohan Zeng, Zhengxiao Du, Chenhui Zhang, Sheng Shen, Tianjun Zhang, Yu Su, Huan Sun, Minlie Huang, et al. AgentBench: Eval- uating LLMs as agents. InThe Twelfth International Conference on Learning Representations. Op...

work page 2024

[23] [23]

SuperGPQA: Scaling LLM Evaluation across 285 Graduate Disciplines

M-A-P Team, Xinrun Du, Yifan Yao, Kaijing Ma, Bingli Wang, Tianyu Zheng, King Zhu, Minghao Liu, Yiming Liang, Xiaolong Jin, et al. SuperGPQA: Scaling LLM evaluation across 285 graduate disciplines.arXiv preprint arXiv:2502.14739, 2025. doi: 10.48550/arXiv.2502. 14739. URLhttps://arxiv.org/abs/2502.14739

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2502 2025

[24] [24]

GAIA: A benchmark for general AI assistants

Grégoire Mialon, Clémentine Fourrier, Thomas Wolf, Yann LeCun, and Thomas Scialom. GAIA: A benchmark for general AI assistants. InThe Twelfth International Conference on Learning Representations. OpenReview.net, 2024. URL https://openreview.net/forum? id=fibxvahvs3

work page 2024

[25] [25]

MiniMax M2.7: Early echoes of self-evolution, 2026

MiniMax. MiniMax M2.7: Early echoes of self-evolution, 2026. URLhttps://www.minimax. io/news/minimax-m27-en. Accessed: 2026-05-06

work page 2026

[26] [26]

Kimi K2.6 tech blog: Advancing open-source coding, 2026

Moonshot AI. Kimi K2.6 tech blog: Advancing open-source coding, 2026. URL https: //www.kimi.com/blog/kimi-k2-6. Accessed: 2026-05-06

work page 2026

[27] [27]

GPT-5.5 system card, 2026

OpenAI. GPT-5.5 system card, 2026. URL https://openai.com/index/ gpt-5-5-system-card/. Accessed: 2026-05-06

work page 2026

[28] [28]

Humanity's Last Exam

Long Phan, Alice Gatti, Nathaniel Li, Adam Khoja, Ryan Kim, Richard Ren, Jason Hausenloy, Oliver Zhang, Mantas Mazeika, Dan Hendrycks, et al. A benchmark of expert-level academic questions to assess AI capabilities.Nature, 649(8099):1139–1146, 2026. doi: 10.1038/ s41586-025-09962-4. URLhttps://doi.org/10.1038/s41586-025-09962-4

work page internal anchor Pith review doi:10.1038/s41586-025-09962-4 2026

[29] [29]

David Rein, Betty Li Hou, Asa Cooper Stickland, Jackson Petty, Richard Yuanzhe Pang, Julien Dirani, Julian Michael, and Samuel R. Bowman. GPQA: A graduate-level google- proof q&a benchmark. InThe Twelfth International Conference on Learning Representations. OpenReview.net, 2024. URLhttps://openreview.net/forum?id=TGnEMgh4PB

work page 2024

[30] [30]

RAGChecker: A fine-grained framework for diagnosing retrieval-augmented generation

Dongyu Ru, Lin Qiu, Xiangkun Hu, Tianhang Zhang, Peng Shi, Shuaichen Chang, Cheng Jiayang, Cunxiang Wang, Shichao Sun, Huanyu Li, Zizhao Zhang, Binjie Wang, Jiarong Jiang, Tong He, Zhiguo Wang, Pengfei Liu, Yue Zhang, and Zheng Zhang. RAGChecker: A fine-grained framework for diagnosing retrieval-augmented generation. InAdvances in Neural Information Proce...

work page

[31] [31]

URL https://proceedings.neurips.cc/paper_ files/paper/2024/file/27245589131d17368cccdfa990cbf16e-Paper-Datasets_ and_Benchmarks_Track.pdf

doi: 10.52202/079017-0692. URL https://proceedings.neurips.cc/paper_ files/paper/2024/file/27245589131d17368cccdfa990cbf16e-Paper-Datasets_ and_Benchmarks_Track.pdf

work page doi:10.52202/079017-0692 2024

[32] [32]

PaperBench: Evaluating AI's Ability to Replicate AI Research

Giulio Starace, Oliver Jaffe, Dane Sherburn, James Aung, Jun Shern Chan, Leon Maksin, Rachel Dias, Evan Mays, Benjamin Kinsella, Wyatt Thompson, Johannes Heidecke, Amelia Glaese, and Tejal Patwardhan. PaperBench: Evaluating AI’s ability to replicate AI research. arXiv preprint arXiv:2504.01848, 2025. doi: 10.48550/arXiv.2504.01848. URL https: //arxiv.org/...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2504.01848 2025

[33] [33]

Wei, Jason Wei, Chris Tar, Yun-Hsuan Sung, Denny Zhou, Quoc V

Tu Vu, Mohit Iyyer, Xuezhi Wang, Noah Constant, Jerry Wei, Jason Wei, Chris Tar, Yun-Hsuan Sung, Denny Zhou, Quoc Le, and Thang Luong. FreshLLMs: Refreshing large language models with search engine augmentation.arXiv preprint arXiv:2310.03214, 2023. URL https://arxiv.org/abs/2310.03214

work page arXiv 2023

[34] [34]

MMLU-Pro: A more robust and challenging multi-task language understanding benchmark

Yubo Wang, Xueguang Ma, Ge Zhang, Yuansheng Ni, Abhranil Chandra, Shiguang Guo, Weiming Ren, Aaran Arulraj, Xuan He, Ziyan Jiang, Tianle Li, Max Ku, Kai Wang, Alex Zhuang, Rongqi Fan, Xiang Yue, and Wenhu Chen. MMLU-Pro: A more robust and challenging multi-task language understanding benchmark. InAdvances in Neural Information Processing Systems, volume 3...

work page

[35] [35]

URL https://proceedings.neurips.cc/paper_files/paper/2024/file/ ad236edc564f3e3156e1b2feafb99a24-Paper-Datasets_and_Benchmarks_Track. pdf. 16

work page 2024

[36] [36]

BrowseComp: A Simple Yet Challenging Benchmark for Browsing Agents

Jason Wei, Zhiqing Sun, Spencer Papay, Scott McKinney, Jeffrey Han, Isa Fulford, Hyung Won Chung, Alex Tachard Passos, William Fedus, and Amelia Glaese. Browsecomp: A simple yet challenging benchmark for browsing agents.arXiv preprint arXiv:2504.12516, 2025. URL https://arxiv.org/abs/2504.12516

work page internal anchor Pith review Pith/arXiv arXiv 2025

[37] [37]

MiMo-V2.5-Pro, 2026

Xiaomi. MiMo-V2.5-Pro, 2026. URL https://mimo.xiaomi.com/mimo-v2-5-pro . Ac- cessed: 2026-05-06

work page 2026

[38] [38]

Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio, William Cohen, Ruslan Salakhutdinov, and Christopher D. Manning. HotpotQA: A dataset for diverse, explainable multi-hop question answering. InProceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 2369–2380. Association for Computational Linguistics, 2018. doi: 10....

work page 2018

[39] [39]

τ-bench: A benchmark for tool-agent-user interaction in real-world domains

Shunyu Yao, Noah Shinn, Pedram Razavi, and Karthik Narasimhan. τ-bench: A benchmark for tool-agent-user interaction in real-world domains. InThe Thir- teenth International Conference on Learning Representations. OpenReview.net,

work page

[40] [40]

URL https://proceedings.iclr.cc/paper_files/paper/2025/hash/ 1b126cc38b8638e07bef37e7b2bb72bf-Abstract-Conference.html

work page 2025

[41] [41]

GLM-5.1 model card, 2026

Zhipu AI. GLM-5.1 model card, 2026. URL https://www.modelscope.cn/models/ ZhipuAI/GLM-5.1. Accessed: 2026-05-06

work page 2026

[42] [42]

BrowseComp-ZH: Benchmarking Web Browsing Ability of Large Language Models in Chinese

Peilin Zhou, Bruce Leon, Xiang Ying, Can Zhang, Yifan Shao, Qichen Ye, Dading Chong, Zhiling Jin, Chenxuan Xie, Meng Cao, Yuxin Gu, Sixin Hong, Jing Ren, Jian Chen, Chao Liu, and Yining Hua. BrowseComp-ZH: Benchmarking web browsing ability of large language models in chinese.arXiv preprint arXiv:2504.19314, 2025. doi: 10.48550/arXiv.2504.19314. URLhttps:/...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2504.19314 2025

[43] [43]

counter-guarantee

Shuyan Zhou, Frank F. Xu, Hao Zhu, Xuhui Zhou, Robert Lo, Abishek Sridhar, Xianyi Cheng, Tianyue Ou, Yonatan Bisk, Daniel Fried, Uri Alon, and Graham Neubig. WebArena: A realistic web environment for building autonomous agents. InThe Twelfth International Conference on Learning Representations. OpenReview.net, 2024. URL https://openreview.net/forum? id=oK...

work page 2024

[44] [44]

Conduct systematic searches to locate relevant sources

work page

[45] [45]

Visit and analyze authoritative sources carefully

work page

[46] [46]

Synthesize information from multiple sources as needed

work page

[47] [47]

Law”, “Legal Studies

Provide your final answer in <answer></answer> tags Final answer format: <answer>Your answer here</answer> This prompt asks for systematic research without giving domain-specific hints. It matches expert search settings where users have questions but no pre-identified sources. E.3 Scoring Methodology Each question specifies detailed grading criteria with ...

work page