Teaching AI Through Benchmark Construction: QuestBench as a Course-Based Practice for Accountable Knowledge Work
Pith reviewed 2026-05-21 04:07 UTC · model grok-4.3
The pith
Building benchmarks in class teaches students to judge AI and shows that current deep research systems fail most expert-level questions.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Students transform their disciplinary expertise into 256 verifiable questions across 14 domains to form QuestBench. After peer review to remove ambiguity and shortcuts, evaluation of thirteen AI deep research systems shows a mean question-level pass rate of 16.85 percent, with GPT-5.5 achieving the highest at 57.58 percent. Failures often involve missing the precise query, appropriate source, key term, or evidence standard despite fluent and sourced answers. Student reflections indicate that constructing the benchmark shifts their view of knowledge from something AI retrieves to the foundation for judging AI outputs.
What carries the argument
The course-based benchmark construction process, where students design expert-level questions, conduct peer review for ambiguity and shortcuts, and evaluate AI systems on the resulting tasks.
Load-bearing premise
That student peer review sufficiently removes ambiguity and shortcuts from the questions so that AI failures reflect actual system limitations rather than question flaws.
What would settle it
Independent domain experts reviewing all 256 QuestBench questions and confirming they are clear and solvable by skilled humans, followed by re-testing the same thirteen systems and finding persistently low pass rates, would support the claim of genuine system limitations.
Figures
read the original abstract
As AI becomes part of everyday learning, many courses teach students to use it mainly as a productivity tool: how to prompt, search, summarize, write, code, and use tools more efficiently. We argue that AI education also needs a setting in which students learn to test AI and understand their own role in judging machine-produced knowledge. To this end, we introduce a course-based practice that teaches AI through benchmark construction, using deep research systems as a concrete example of AI-era knowledge work. Students turn disciplinary knowledge into verifiable expert-level questions, review one another's designs for ambiguity and shortcuts, and evaluate AI systems on the resulting tasks. This activity gives students direct exposure to a powerful tool while asking them to specify what a trustworthy answer would require. The produced benchmark, QuestBench, consists of 256 questions across 14 humanities and social-science domains. Evaluation on QuestBench shows that student-designed tasks reveal hidden failures in current deep research systems: across thirteen evaluated systems, the mean question-level pass rate is only 16.85%, and the best-performing system, GPT-5.5, reaches a 57.58% pass rate. The failures are educationally useful because they show how fluent, source-backed answers can still miss the right query, source, term, or evidence standard. Reflections from five student contributors suggest that benchmark construction can help students see professional knowledge not only as content AI may retrieve, but as the basis for judging AI outputs. We present QuestBench as a benchmark artifact and as a reusable classroom setting for a larger educational question: how students can remain responsible knowledge actors as AI enters learning and professional work. The dataset is available at https://huggingface.co/datasets/PKUAIWeb/QuestBench/tree/main.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces a course-based practice for teaching AI through benchmark construction, using deep research systems as the target. Students in humanities and social-science domains create 256 verifiable expert-level questions (QuestBench), peer-review one another's designs to remove ambiguity and shortcuts, and then evaluate 13 AI systems on the resulting tasks. The central empirical result is that current systems perform poorly, with a mean question-level pass rate of 16.85% and the best system (GPT-5.5) reaching only 57.58%. The authors argue that the failures are educationally useful, that the activity helps students see professional knowledge as a basis for judging AI outputs, and that QuestBench serves both as a public benchmark artifact and as a reusable classroom setting for accountable knowledge work.
Significance. If the questions are verifiably expert-level and free of design artifacts, the work supplies a concrete, replicable pedagogical intervention that shifts AI education from tool-use to critical evaluation. The public release of the 256-question dataset on Hugging Face is a clear strength for reproducibility and follow-on research. The reported performance gap also offers falsifiable evidence about the current limits of deep research systems on tasks that require precise source selection, terminology, and evidence standards.
major comments (2)
- [Benchmark construction and review process] The description of the peer-review process (the account of students reviewing one another's designs for ambiguity and shortcuts) provides no quantitative metrics: no inter-rater agreement scores, no revision or discard rates, no external expert validation, and no breakdown of how many of the 256 questions survived review. This detail is load-bearing for the headline claim that the 16.85% mean and 57.58% best-system pass rates reflect genuine system limitations rather than question-design artifacts.
- [Evaluation on QuestBench] The evaluation protocol is described at a high level but does not specify the exact scoring rubric, whether answers were judged by the question authors or independent raters, or how partial credit or source verification was handled. Without these details it is difficult to interpret the reported pass rates or to rule out post-hoc adjustments.
minor comments (2)
- [Introduction] The term 'deep research systems' is used repeatedly but never given an explicit operational definition or list of the thirteen evaluated models beyond the mention of GPT-5.5; a short table or footnote would improve clarity.
- [Data availability] The dataset link is provided, but the manuscript does not state whether the released files include the original student questions, the final validated versions, the AI responses, or the scoring keys.
Simulated Author's Rebuttal
We thank the referee for the constructive comments, which highlight opportunities to strengthen the transparency of our benchmark construction and evaluation protocols. We address each major comment below and will revise the manuscript to incorporate additional details where possible.
read point-by-point responses
-
Referee: The description of the peer-review process provides no quantitative metrics: no inter-rater agreement scores, no revision or discard rates, no external expert validation, and no breakdown of how many of the 256 questions survived review. This is load-bearing for claims that the pass rates reflect genuine system limitations rather than design artifacts.
Authors: We agree that quantitative metrics on the peer-review process would improve clarity and address potential concerns about question quality. In the revised manuscript, we will add a dedicated subsection on the review workflow. This will include: the total number of questions initially drafted by students, the number revised or discarded during peer review, any available inter-rater agreement statistics from the peer-review rounds, and a note that external expert validation was not performed (relying instead on the disciplinary expertise of the student authors and the verifiable nature of the questions). We maintain that the combination of student expertise, peer scrutiny for ambiguity and shortcuts, and the public release of the dataset supports the headline results, but we acknowledge the value of these additional metrics for reproducibility. revision: yes
-
Referee: The evaluation protocol is described at a high level but does not specify the exact scoring rubric, whether answers were judged by the question authors or independent raters, or how partial credit or source verification was handled. This makes it difficult to interpret the pass rates or rule out post-hoc adjustments.
Authors: We appreciate this observation and will expand the evaluation section in the revision. The updated text will specify: the binary pass/fail scoring rubric (an answer passes only if it fully satisfies all expert criteria listed in the question, including source selection, terminology, and evidence standards); that judgments were performed by the question authors with cross-checking by at least one peer reviewer; that no partial credit was awarded; and that source verification was required as part of the rubric. We will also include one or two concrete scoring examples to illustrate the process. These clarifications will make the protocol fully reproducible and eliminate ambiguity about how the 16.85% mean and 57.58% best-system rates were derived. revision: yes
Circularity Check
No significant circularity: empirical measurements on external systems
full rationale
The paper constructs QuestBench via student-designed questions and peer review, then reports direct empirical pass rates (mean 16.85%, best system 57.58%) on thirteen external AI systems. No equations, fitted parameters, or first-principles derivations appear; the central claims are measurements against independent models rather than reductions to self-referential inputs or self-citation chains. The evaluation protocol is self-contained against external benchmarks, with no load-bearing steps that equate outputs to inputs by construction. This is the expected honest non-finding for an empirical benchmark paper.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Qwen3.6-plus: Towards real world agents, 2026
Alibaba Cloud. Qwen3.6-plus: Towards real world agents, 2026. URL https://qwen.ai/ blog?id=qwen3.6. Accessed: 2026-05-06
work page 2026
-
[2]
What’s new in claude opus 4.7, 2026
Anthropic. What’s new in claude opus 4.7, 2026. URL https://platform.claude.com/ docs/en/about-claude/models/whats-new-claude-4-7. Accessed: 2026-05-06
work page 2026
-
[3]
Seed1.8 Model Card: Towards Generalized Real-World Agency
ByteDance Seed. Seed1.8 Model Card: Towards generalized real-world agency.arXiv preprint arXiv:2603.20633, 2026. doi: 10.48550/arXiv.2603.20633. URL https://arxiv.org/abs/ 2603.20633
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2603.20633 2026
-
[4]
Seed2.0 Model Card: Towards intelligence frontier for real-world complexity,
ByteDance Seed. Seed2.0 Model Card: Towards intelligence frontier for real-world complexity,
- [5]
-
[6]
FinQA: A dataset of numerical reasoning over financial data
Zhiyu Chen, Wenhu Chen, Charese Smiley, Sameena Shah, Iana Borova, Dylan Langdon, Reema Moussa, Matt Beane, Ting-Hao Huang, Bryan Routledge, and William Yang Wang. FinQA: A dataset of numerical reasoning over financial data. InProceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 3697–3711. Association for Computati...
-
[7]
DeepSeek V4 technical documentation, 2026
DeepSeek AI. DeepSeek V4 technical documentation, 2026. URL https://fe-static. deepseek.com/chat/transparency/deepseek-V4-model-card-EN.pdf . Accessed: 2026-05-06
work page 2026
-
[8]
DeepSeek-V3.2: Pushing the Frontier of Open Large Language Models
DeepSeek-AI, Aixin Liu, Aoxue Mei, Bangcai Lin, Bing Xue, Bingxuan Wang, Bingzheng Xu, Bochao Wu, Bowei Zhang, Chaofan Lin, et al. DeepSeek-V3.2: Pushing the frontier of open large language models.arXiv preprint arXiv:2512.02556, 2025. doi: 10.48550/arXiv.2512. 02556. URLhttps://arxiv.org/abs/2512.02556. 14
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2512 2025
-
[9]
Mind2Web: Towards a generalist agent for the web
Xiang Deng, Yu Gu, Boyuan Zheng, Shijie Chen, Sam Stevens, Boshi Wang, Huan Sun, and Yu Su. Mind2Web: Towards a generalist agent for the web. InAdvances in Neu- ral Information Processing Systems, volume 36, pages 28091–28114. Curran Associates, Inc., 2023. URL https://proceedings.neurips.cc/paper_files/paper/2023/file/ 5950bf290a1570ea401bf98882128160-Pa...
work page 2023
-
[10]
DeepResearch Bench: A Comprehensive Benchmark for Deep Research Agents
Mingxuan Du, Benfeng Xu, Chiwei Zhu, Xiaorui Wang, and Zhendong Mao. DeepResearch Bench: A comprehensive benchmark for deep research agents.arXiv preprint arXiv:2506.11763,
work page internal anchor Pith review Pith/arXiv arXiv
-
[11]
DeepResearch Bench: A Comprehensive Benchmark for Deep Research Agents
doi: 10.48550/arXiv.2506.11763. URLhttps://arxiv.org/abs/2506.11763
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2506.11763
-
[12]
Gemini 3.1 pro model card, 2026
Google DeepMind. Gemini 3.1 pro model card, 2026. URL https://deepmind.google/ models/model-cards/gemini-3-1-pro/. Accessed: 2026-05-06
work page 2026
-
[13]
LegalBench: A collaboratively built benchmark for measuring legal reasoning in large language models
Neel Guha, Julian Nyarko, Daniel Ho, Christopher Ré, Adam Chilton, Aditya K, Alex Chohlas- Wood, Austin Peters, Brandon Waldon, Daniel Rockmore, et al. LegalBench: A collaboratively built benchmark for measuring legal reasoning in large language models. InAdvances in Neural Information Processing Systems, volume 36, pages 44123–44279. Curran Associates, I...
work page 2023
-
[14]
Nikita Gupta, Riju Chatterjee, Lukas Haas, Connie Tao, Andrew Wang, Chang Liu, Hidekazu Oiwa, Elena Gribovskaya, Jan Ackermann, John Blitzer, Sasha Goldshtein, and Dipanjan Das. DeepSearchQA: Bridging the comprehensiveness gap for deep research agents.arXiv preprint arXiv:2601.20975, 2026. URLhttps://arxiv.org/abs/2601.20975
-
[15]
PubMedQA: A dataset for biomedical research question answering
Qiao Jin, Bhuwan Dhingra, Zhengping Liu, William Cohen, and Xinghua Lu. PubMedQA: A dataset for biomedical research question answering. InProceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, pages 2567–2577. Association for Computational Linguistics,
work page 2019
-
[16]
URLhttps://aclanthology.org/D19-1259/
doi: 10.18653/v1/D19-1259. URLhttps://aclanthology.org/D19-1259/
-
[17]
Mandar Joshi, Eunsol Choi, Daniel Weld, and Luke Zettlemoyer. TriviaQA: A large scale distantly supervised challenge dataset for reading comprehension. InProceedings of the 55th Annual Meeting of the Association for Computational Linguistics, pages 1601–1611. Association for Computational Linguistics, 2017. doi: 10.18653/v1/P17-1147. URL https: //aclantho...
-
[18]
Hao Kang and Chenyan Xiong. ResearchArena: Benchmarking large language models’ ability to collect and organize information as research agents. InFindings of the Association for Computational Linguistics: EMNLP 2025, pages 5653–5671. Association for Computational Linguistics, 2025. doi: 10.18653/v1/2025.findings-emnlp.303. URL https://aclanthology. org/202...
-
[19]
Kimi Team, Tongtong Bai, Yifan Bai, Yiping Bao, S. H. Cai, Yuan Cao, Y . Charles, H. S. Che, Cheng Chen, Guanduo Chen, et al. Kimi K2.5: Visual agentic intelligence.arXiv preprint arXiv:2602.02276, 2026. doi: 10.48550/arXiv.2602.02276. URL https://arxiv.org/abs/ 2602.02276
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2602.02276 2026
-
[20]
Satyapriya Krishna, Kalpesh Krishna, Anhad Mohananey, Steven Schwarcz, Adam Stambler, Shyam Upadhyay, and Manaal Faruqui. Fact, fetch, and reason: A unified evaluation of retrieval- augmented generation. InProceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies...
-
[21]
Tom Kwiatkowski, Jennimaria Palomaki, Olivia Redfield, Michael Collins, Ankur Parikh, Chris Alberti, Danielle Epstein, Illia Polosukhin, Jacob Devlin, Kenton Lee, et al. Natural questions: A benchmark for question answering research.Transactions of the Association for Computational Linguistics, 7:452–466, 2019. doi: 10.1162/tacl_a_00276. URL https: //acla...
-
[22]
AgentBench: Eval- uating LLMs as agents
Xiao Liu, Hao Yu, Hanchen Zhang, Yifan Xu, Xuanyu Lei, Hanyu Lai, Yu Gu, Hangliang Ding, Kaiwen Men, Kejuan Yang, Shudan Zhang, Xiang Deng, Aohan Zeng, Zhengxiao Du, Chenhui Zhang, Sheng Shen, Tianjun Zhang, Yu Su, Huan Sun, Minlie Huang, et al. AgentBench: Eval- uating LLMs as agents. InThe Twelfth International Conference on Learning Representations. Op...
work page 2024
-
[23]
SuperGPQA: Scaling LLM Evaluation across 285 Graduate Disciplines
M-A-P Team, Xinrun Du, Yifan Yao, Kaijing Ma, Bingli Wang, Tianyu Zheng, King Zhu, Minghao Liu, Yiming Liang, Xiaolong Jin, et al. SuperGPQA: Scaling LLM evaluation across 285 graduate disciplines.arXiv preprint arXiv:2502.14739, 2025. doi: 10.48550/arXiv.2502. 14739. URLhttps://arxiv.org/abs/2502.14739
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2502 2025
-
[24]
GAIA: A benchmark for general AI assistants
Grégoire Mialon, Clémentine Fourrier, Thomas Wolf, Yann LeCun, and Thomas Scialom. GAIA: A benchmark for general AI assistants. InThe Twelfth International Conference on Learning Representations. OpenReview.net, 2024. URL https://openreview.net/forum? id=fibxvahvs3
work page 2024
-
[25]
MiniMax M2.7: Early echoes of self-evolution, 2026
MiniMax. MiniMax M2.7: Early echoes of self-evolution, 2026. URLhttps://www.minimax. io/news/minimax-m27-en. Accessed: 2026-05-06
work page 2026
-
[26]
Kimi K2.6 tech blog: Advancing open-source coding, 2026
Moonshot AI. Kimi K2.6 tech blog: Advancing open-source coding, 2026. URL https: //www.kimi.com/blog/kimi-k2-6. Accessed: 2026-05-06
work page 2026
-
[27]
OpenAI. GPT-5.5 system card, 2026. URL https://openai.com/index/ gpt-5-5-system-card/. Accessed: 2026-05-06
work page 2026
-
[28]
Long Phan, Alice Gatti, Nathaniel Li, Adam Khoja, Ryan Kim, Richard Ren, Jason Hausenloy, Oliver Zhang, Mantas Mazeika, Dan Hendrycks, et al. A benchmark of expert-level academic questions to assess AI capabilities.Nature, 649(8099):1139–1146, 2026. doi: 10.1038/ s41586-025-09962-4. URLhttps://doi.org/10.1038/s41586-025-09962-4
work page internal anchor Pith review doi:10.1038/s41586-025-09962-4 2026
-
[29]
David Rein, Betty Li Hou, Asa Cooper Stickland, Jackson Petty, Richard Yuanzhe Pang, Julien Dirani, Julian Michael, and Samuel R. Bowman. GPQA: A graduate-level google- proof q&a benchmark. InThe Twelfth International Conference on Learning Representations. OpenReview.net, 2024. URLhttps://openreview.net/forum?id=TGnEMgh4PB
work page 2024
-
[30]
RAGChecker: A fine-grained framework for diagnosing retrieval-augmented generation
Dongyu Ru, Lin Qiu, Xiangkun Hu, Tianhang Zhang, Peng Shi, Shuaichen Chang, Cheng Jiayang, Cunxiang Wang, Shichao Sun, Huanyu Li, Zizhao Zhang, Binjie Wang, Jiarong Jiang, Tong He, Zhiguo Wang, Pengfei Liu, Yue Zhang, and Zheng Zhang. RAGChecker: A fine-grained framework for diagnosing retrieval-augmented generation. InAdvances in Neural Information Proce...
-
[31]
doi: 10.52202/079017-0692. URL https://proceedings.neurips.cc/paper_ files/paper/2024/file/27245589131d17368cccdfa990cbf16e-Paper-Datasets_ and_Benchmarks_Track.pdf
-
[32]
PaperBench: Evaluating AI's Ability to Replicate AI Research
Giulio Starace, Oliver Jaffe, Dane Sherburn, James Aung, Jun Shern Chan, Leon Maksin, Rachel Dias, Evan Mays, Benjamin Kinsella, Wyatt Thompson, Johannes Heidecke, Amelia Glaese, and Tejal Patwardhan. PaperBench: Evaluating AI’s ability to replicate AI research. arXiv preprint arXiv:2504.01848, 2025. doi: 10.48550/arXiv.2504.01848. URL https: //arxiv.org/...
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2504.01848 2025
-
[33]
Tu Vu, Mohit Iyyer, Xuezhi Wang, Noah Constant, Jerry Wei, Jason Wei, Chris Tar, Yun-Hsuan Sung, Denny Zhou, Quoc Le, and Thang Luong. FreshLLMs: Refreshing large language models with search engine augmentation.arXiv preprint arXiv:2310.03214, 2023. URL https://arxiv.org/abs/2310.03214
-
[34]
MMLU-Pro: A more robust and challenging multi-task language understanding benchmark
Yubo Wang, Xueguang Ma, Ge Zhang, Yuansheng Ni, Abhranil Chandra, Shiguang Guo, Weiming Ren, Aaran Arulraj, Xuan He, Ziyan Jiang, Tianle Li, Max Ku, Kai Wang, Alex Zhuang, Rongqi Fan, Xiang Yue, and Wenhu Chen. MMLU-Pro: A more robust and challenging multi-task language understanding benchmark. InAdvances in Neural Information Processing Systems, volume 3...
-
[35]
URL https://proceedings.neurips.cc/paper_files/paper/2024/file/ ad236edc564f3e3156e1b2feafb99a24-Paper-Datasets_and_Benchmarks_Track. pdf. 16
work page 2024
-
[36]
BrowseComp: A Simple Yet Challenging Benchmark for Browsing Agents
Jason Wei, Zhiqing Sun, Spencer Papay, Scott McKinney, Jeffrey Han, Isa Fulford, Hyung Won Chung, Alex Tachard Passos, William Fedus, and Amelia Glaese. Browsecomp: A simple yet challenging benchmark for browsing agents.arXiv preprint arXiv:2504.12516, 2025. URL https://arxiv.org/abs/2504.12516
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[37]
Xiaomi. MiMo-V2.5-Pro, 2026. URL https://mimo.xiaomi.com/mimo-v2-5-pro . Ac- cessed: 2026-05-06
work page 2026
-
[38]
Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio, William Cohen, Ruslan Salakhutdinov, and Christopher D. Manning. HotpotQA: A dataset for diverse, explainable multi-hop question answering. InProceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 2369–2380. Association for Computational Linguistics, 2018. doi: 10....
work page 2018
-
[39]
τ-bench: A benchmark for tool-agent-user interaction in real-world domains
Shunyu Yao, Noah Shinn, Pedram Razavi, and Karthik Narasimhan. τ-bench: A benchmark for tool-agent-user interaction in real-world domains. InThe Thir- teenth International Conference on Learning Representations. OpenReview.net,
-
[40]
URL https://proceedings.iclr.cc/paper_files/paper/2025/hash/ 1b126cc38b8638e07bef37e7b2bb72bf-Abstract-Conference.html
work page 2025
-
[41]
Zhipu AI. GLM-5.1 model card, 2026. URL https://www.modelscope.cn/models/ ZhipuAI/GLM-5.1. Accessed: 2026-05-06
work page 2026
-
[42]
BrowseComp-ZH: Benchmarking Web Browsing Ability of Large Language Models in Chinese
Peilin Zhou, Bruce Leon, Xiang Ying, Can Zhang, Yifan Shao, Qichen Ye, Dading Chong, Zhiling Jin, Chenxuan Xie, Meng Cao, Yuxin Gu, Sixin Hong, Jing Ren, Jian Chen, Chao Liu, and Yining Hua. BrowseComp-ZH: Benchmarking web browsing ability of large language models in chinese.arXiv preprint arXiv:2504.19314, 2025. doi: 10.48550/arXiv.2504.19314. URLhttps:/...
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2504.19314 2025
-
[43]
Shuyan Zhou, Frank F. Xu, Hao Zhu, Xuhui Zhou, Robert Lo, Abishek Sridhar, Xianyi Cheng, Tianyue Ou, Yonatan Bisk, Daniel Fried, Uri Alon, and Graham Neubig. WebArena: A realistic web environment for building autonomous agents. InThe Twelfth International Conference on Learning Representations. OpenReview.net, 2024. URL https://openreview.net/forum? id=oK...
work page 2024
-
[44]
Conduct systematic searches to locate relevant sources
-
[45]
Visit and analyze authoritative sources carefully
-
[46]
Synthesize information from multiple sources as needed
-
[47]
Provide your final answer in <answer></answer> tags Final answer format: <answer>Your answer here</answer> This prompt asks for systematic research without giving domain-specific hints. It matches expert search settings where users have questions but no pre-identified sources. E.3 Scoring Methodology Each question specifies detailed grading criteria with ...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.