Teaching AI Through Benchmark Construction: QuestBench as a Course-Based Practice for Accountable Knowledge Work
Pith reviewed 2026-05-22 09:39 UTC · model grok-4.3
The pith
Student-designed questions show AI deep research systems pass only 17 percent of expert tasks on average.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
QuestBench is generated when students turn domain knowledge into expert-level questions that peers review for clarity and completeness; evaluation of thirteen deep research systems on the 256 questions yields a mean pass rate of 16.85 percent, with the strongest system, GPT-5.5, reaching 57.58 percent. These results show that even answers backed by sources can fail on precise query formulation, source choice, terminology, or evidence requirements. Reflections from student contributors indicate that constructing and applying the benchmark helps them see disciplinary expertise as the basis for evaluating AI rather than as content that machines simply fetch.
What carries the argument
The classroom benchmark-construction cycle in which students create, review, and apply expert questions to evaluate AI deep research systems, producing both the QuestBench dataset and direct experience of defining trustworthy answer standards.
If this is right
- Fluent AI answers can still fail expert tasks by selecting the wrong query focus, source, term, or evidence level.
- Students gain practice specifying what counts as a trustworthy answer when they design and critique the questions.
- The activity supplies a reusable classroom format that turns AI evaluation into an educational exercise rather than a black-box use.
- Reflections show students come to view their own knowledge as the criterion for assessing machine outputs.
Where Pith is reading between the lines
- The same student-led construction method could be tested in STEM or professional-training settings to map AI failure modes across different knowledge domains.
- Collected student questions could serve as additional evaluation or fine-tuning data to improve future deep research systems.
- Widespread classroom use might gradually shift AI literacy curricula from usage skills toward explicit training in output judgment.
Load-bearing premise
Student-created and peer-reviewed questions form unbiased expert-level tests whose failures reflect genuine AI limitations rather than flaws in question design or grading criteria.
What would settle it
If the same AI systems were tested on an equivalent set of questions written and reviewed by practicing domain experts and produced markedly higher pass rates, the claim that student benchmarks reliably expose AI shortcomings would be undermined.
Figures
read the original abstract
As AI becomes part of everyday learning, many courses teach students to use it mainly as a productivity tool: how to prompt, search, summarize, write, code, and use tools more efficiently. We argue that AI education also needs a setting in which students learn to test AI and understand their own role in judging machine-produced knowledge. To this end, we introduce a course-based practice that teaches AI through benchmark construction, using deep research systems as a concrete example of AI-era knowledge work. Students turn disciplinary knowledge into verifiable expert-level questions, review one another's designs for ambiguity and shortcuts, and evaluate AI systems on the resulting tasks. This activity gives students direct exposure to a powerful tool while asking them to specify what a trustworthy answer would require. The produced benchmark, QuestBench, consists of 256 questions across 14 humanities and social-science domains. Evaluation on QuestBench shows that student-designed tasks reveal hidden failures in current deep research systems: across thirteen evaluated systems, the mean question-level pass rate is only 16.85%, and the best-performing system, GPT-5.5, reaches a 57.58% pass rate. The failures are educationally useful because they show how fluent, source-backed answers can still miss the right query, source, term, or evidence standard. Reflections from five student contributors suggest that benchmark construction can help students see professional knowledge not only as content AI may retrieve, but as the basis for judging AI outputs. We present QuestBench as a benchmark artifact and as a reusable classroom setting for a larger educational question: how students can remain responsible knowledge actors as AI enters learning and professional work. The dataset is available at https://huggingface.co/datasets/PKUAIWeb/QuestBench/tree/main.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces QuestBench, a benchmark of 256 student-constructed questions across 14 humanities and social-science domains, created via a course activity in which students draft expert-level questions from disciplinary knowledge, perform mutual peer reviews for ambiguity and shortcuts, and then evaluate 13 AI deep-research systems. It reports a mean question-level pass rate of 16.85% (best system GPT-5.5 at 57.58%), argues that these results expose hidden limitations in current systems, and presents the activity as an educational practice for teaching accountable knowledge work. The dataset is released publicly.
Significance. If the questions are shown to be unambiguous and calibrated to expert standards, the work supplies both a reusable benchmark artifact and a concrete classroom method that shifts AI education from tool-use instruction toward critical evaluation of machine-generated knowledge. The low pass rates, if robust, would constitute falsifiable evidence of current limitations in source selection, evidence standards, and query interpretation for non-STEM research tasks.
major comments (2)
- [Abstract] Abstract: The headline claim that student-designed tasks reveal genuine AI limitations (mean 16.85% pass rate) is load-bearing on the assumption that the 256 questions are free of ambiguities, shortcuts, and design flaws. The described construction process relies exclusively on student drafting and mutual reviews; no inter-annotator agreement statistics, external domain-expert validation, or quantitative difficulty calibration are reported. This directly affects whether the observed failures can be attributed to the AI systems rather than to question wording or evaluation criteria.
- [Evaluation] Evaluation description: The exact operational definition of a 'pass' (e.g., whether partial credit, source citation requirements, or human judgment rubrics are used) is not specified. Without this, the reported rates (including the 57.58% figure for GPT-5.5) cannot be independently verified or compared across systems.
minor comments (2)
- [Abstract] The abstract states 14 domains but provides no list or distribution; adding a brief table or enumeration would improve clarity.
- The reflections from five student contributors are referenced but not excerpted; including one or two concrete examples of how benchmark construction changed their view of AI would strengthen the educational argument without lengthening the paper substantially.
Simulated Author's Rebuttal
We thank the referee for their detailed and constructive comments. We address the major concerns regarding the robustness of the benchmark construction and the clarity of the evaluation protocol below. Where appropriate, we will revise the manuscript to incorporate additional details and clarifications.
read point-by-point responses
-
Referee: [Abstract] Abstract: The headline claim that student-designed tasks reveal genuine AI limitations (mean 16.85% pass rate) is load-bearing on the assumption that the 256 questions are free of ambiguities, shortcuts, and design flaws. The described construction process relies exclusively on student drafting and mutual reviews; no inter-annotator agreement statistics, external domain-expert validation, or quantitative difficulty calibration are reported. This directly affects whether the observed failures can be attributed to the AI systems rather than to question wording or evaluation criteria.
Authors: We acknowledge the importance of demonstrating that the questions are unambiguous and aligned with expert standards. The peer review process was intended to mitigate ambiguities and shortcuts, with students instructed to flag issues in each other's questions. However, we did not collect or report inter-annotator agreement statistics, nor did we obtain validation from external domain experts outside the student cohort. Quantitative difficulty calibration was not performed a priori. These are valid points. In the revised version, we will provide a more detailed account of the peer review guidelines and process, include examples of revised questions, and explicitly discuss these as limitations of the current study, proposing external validation as future work. We maintain that the low pass rates, even if some questions have minor issues, still indicate challenges for AI systems, but we will tone down the claim to reflect the construction method. revision: partial
-
Referee: [Evaluation] Evaluation description: The exact operational definition of a 'pass' (e.g., whether partial credit, source citation requirements, or human judgment rubrics are used) is not specified. Without this, the reported rates (including the 57.58% figure for GPT-5.5) cannot be independently verified or compared across systems.
Authors: We agree that the operational definition of a 'pass' must be clearly specified for reproducibility. We will add a dedicated subsection in the revised manuscript that provides the exact operational definition of a 'pass', including details on the human judgment rubric, source citation requirements, and handling of partial answers. This clarification will enable independent verification and comparison. revision: yes
Circularity Check
No circularity: direct empirical evaluation on newly constructed benchmark
full rationale
The paper describes a classroom activity in which students create 256 questions across 14 domains, perform peer reviews for ambiguity and shortcuts, and then run direct evaluations of 13 AI systems, reporting observed pass rates (mean 16.85%, best system 57.58%). No equations, fitted parameters, predictions derived from prior fits, or self-citation chains appear in the provided text. The central claims are straightforward measurements on an externally released dataset; they do not reduce to the inputs by construction or rely on unverified self-referential premises.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Student-designed and peer-reviewed questions can serve as reliable expert-level tests for AI deep research systems.
Reference graph
Works this paper leans on
-
[1]
Qwen3.6-plus: Towards real world agents, 2026
Alibaba Cloud. Qwen3.6-plus: Towards real world agents, 2026. URL https://qwen.ai/ blog?id=qwen3.6. Accessed: 2026-05-06
work page 2026
-
[2]
What’s new in claude opus 4.7, 2026
Anthropic. What’s new in claude opus 4.7, 2026. URL https://platform.claude.com/ docs/en/about-claude/models/whats-new-claude-4-7. Accessed: 2026-05-06
work page 2026
-
[3]
Seed1.8 Model Card: Towards Generalized Real-World Agency
ByteDance Seed. Seed1.8 Model Card: Towards generalized real-world agency.arXiv preprint arXiv:2603.20633, 2026. doi: 10.48550/arXiv.2603.20633. URL https://arxiv.org/abs/ 2603.20633
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2603.20633 2026
-
[4]
Seed2.0 Model Card: Towards intelligence frontier for real-world complexity,
ByteDance Seed. Seed2.0 Model Card: Towards intelligence frontier for real-world complexity,
- [5]
-
[6]
FinQA: A dataset of numerical reasoning over financial data
Zhiyu Chen, Wenhu Chen, Charese Smiley, Sameena Shah, Iana Borova, Dylan Langdon, Reema Moussa, Matt Beane, Ting-Hao Huang, Bryan Routledge, and William Yang Wang. FinQA: A dataset of numerical reasoning over financial data. InProceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 3697–3711. Association for Computati...
-
[7]
DeepSeek V4 technical documentation, 2026
DeepSeek AI. DeepSeek V4 technical documentation, 2026. URL https://fe-static. deepseek.com/chat/transparency/deepseek-V4-model-card-EN.pdf . Accessed: 2026-05-06
work page 2026
-
[8]
DeepSeek-V3.2: Pushing the Frontier of Open Large Language Models
DeepSeek-AI, Aixin Liu, Aoxue Mei, Bangcai Lin, Bing Xue, Bingxuan Wang, Bingzheng Xu, Bochao Wu, Bowei Zhang, Chaofan Lin, et al. DeepSeek-V3.2: Pushing the frontier of open large language models.arXiv preprint arXiv:2512.02556, 2025. doi: 10.48550/arXiv.2512. 02556. URLhttps://arxiv.org/abs/2512.02556. 14
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2512 2025
-
[9]
Mind2Web: Towards a generalist agent for the web
Xiang Deng, Yu Gu, Boyuan Zheng, Shijie Chen, Sam Stevens, Boshi Wang, Huan Sun, and Yu Su. Mind2Web: Towards a generalist agent for the web. InAdvances in Neu- ral Information Processing Systems, volume 36, pages 28091–28114. Curran Associates, Inc., 2023. URL https://proceedings.neurips.cc/paper_files/paper/2023/file/ 5950bf290a1570ea401bf98882128160-Pa...
work page 2023
-
[10]
DeepResearch Bench: A Comprehensive Benchmark for Deep Research Agents
Mingxuan Du, Benfeng Xu, Chiwei Zhu, Xiaorui Wang, and Zhendong Mao. DeepResearch Bench: A comprehensive benchmark for deep research agents.arXiv preprint arXiv:2506.11763,
work page internal anchor Pith review Pith/arXiv arXiv
-
[11]
DeepResearch Bench: A Comprehensive Benchmark for Deep Research Agents
doi: 10.48550/arXiv.2506.11763. URLhttps://arxiv.org/abs/2506.11763
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2506.11763
-
[12]
Gemini 3.1 pro model card, 2026
Google DeepMind. Gemini 3.1 pro model card, 2026. URL https://deepmind.google/ models/model-cards/gemini-3-1-pro/. Accessed: 2026-05-06
work page 2026
-
[13]
LegalBench: A collaboratively built benchmark for measuring legal reasoning in large language models
Neel Guha, Julian Nyarko, Daniel Ho, Christopher Ré, Adam Chilton, Aditya K, Alex Chohlas- Wood, Austin Peters, Brandon Waldon, Daniel Rockmore, et al. LegalBench: A collaboratively built benchmark for measuring legal reasoning in large language models. InAdvances in Neural Information Processing Systems, volume 36, pages 44123–44279. Curran Associates, I...
work page 2023
-
[14]
Nikita Gupta, Riju Chatterjee, Lukas Haas, Connie Tao, Andrew Wang, Chang Liu, Hidekazu Oiwa, Elena Gribovskaya, Jan Ackermann, John Blitzer, Sasha Goldshtein, and Dipanjan Das. DeepSearchQA: Bridging the comprehensiveness gap for deep research agents.arXiv preprint arXiv:2601.20975, 2026. URLhttps://arxiv.org/abs/2601.20975
-
[15]
PubMedQA: A dataset for biomedical research question answering
Qiao Jin, Bhuwan Dhingra, Zhengping Liu, William Cohen, and Xinghua Lu. PubMedQA: A dataset for biomedical research question answering. InProceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, pages 2567–2577. Association for Computational Linguistics,
work page 2019
-
[16]
PubMedQA: A Dataset for Biomedical Research Question Answering
doi: 10.18653/v1/D19-1259. URLhttps://aclanthology.org/D19-1259/
-
[17]
TriviaQA: A large scale distantly supervised challenge dataset for reading comprehension
Mandar Joshi, Eunsol Choi, Daniel Weld, and Luke Zettlemoyer. TriviaQA: A large scale distantly supervised challenge dataset for reading comprehension. InProceedings of the 55th Annual Meeting of the Association for Computational Linguistics, pages 1601–1611. Association for Computational Linguistics, 2017. doi: 10.18653/v1/P17-1147. URL https: //aclantho...
-
[18]
Hao Kang and Chenyan Xiong. ResearchArena: Benchmarking large language models’ ability to collect and organize information as research agents. InFindings of the Association for Computational Linguistics: EMNLP 2025, pages 5653–5671. Association for Computational Linguistics, 2025. doi: 10.18653/v1/2025.findings-emnlp.303. URL https://aclanthology. org/202...
-
[19]
Kimi Team, Tongtong Bai, Yifan Bai, Yiping Bao, S. H. Cai, Yuan Cao, Y . Charles, H. S. Che, Cheng Chen, Guanduo Chen, et al. Kimi K2.5: Visual agentic intelligence.arXiv preprint arXiv:2602.02276, 2026. doi: 10.48550/arXiv.2602.02276. URL https://arxiv.org/abs/ 2602.02276
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2602.02276 2026
-
[20]
In: Christodoulopoulos, C., Chakraborty, T., Rose, C., Peng, V
Satyapriya Krishna, Kalpesh Krishna, Anhad Mohananey, Steven Schwarcz, Adam Stambler, Shyam Upadhyay, and Manaal Faruqui. Fact, fetch, and reason: A unified evaluation of retrieval- augmented generation. InProceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies...
-
[21]
Dai, Jakob Uszkoreit, Quoc Le, and Slav Petrov
Tom Kwiatkowski, Jennimaria Palomaki, Olivia Redfield, Michael Collins, Ankur Parikh, Chris Alberti, Danielle Epstein, Illia Polosukhin, Jacob Devlin, Kenton Lee, et al. Natural questions: A benchmark for question answering research.Transactions of the Association for Computational Linguistics, 7:452–466, 2019. doi: 10.1162/tacl_a_00276. URL https: //acla...
-
[22]
AgentBench: Eval- uating LLMs as agents
Xiao Liu, Hao Yu, Hanchen Zhang, Yifan Xu, Xuanyu Lei, Hanyu Lai, Yu Gu, Hangliang Ding, Kaiwen Men, Kejuan Yang, Shudan Zhang, Xiang Deng, Aohan Zeng, Zhengxiao Du, Chenhui Zhang, Sheng Shen, Tianjun Zhang, Yu Su, Huan Sun, Minlie Huang, et al. AgentBench: Eval- uating LLMs as agents. InThe Twelfth International Conference on Learning Representations. Op...
work page 2024
-
[23]
SuperGPQA: Scaling LLM Evaluation across 285 Graduate Disciplines
M-A-P Team, Xinrun Du, Yifan Yao, Kaijing Ma, Bingli Wang, Tianyu Zheng, King Zhu, Minghao Liu, Yiming Liang, Xiaolong Jin, et al. SuperGPQA: Scaling LLM evaluation across 285 graduate disciplines.arXiv preprint arXiv:2502.14739, 2025. doi: 10.48550/arXiv.2502. 14739. URLhttps://arxiv.org/abs/2502.14739
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2502 2025
-
[24]
GAIA: A benchmark for general AI assistants
Grégoire Mialon, Clémentine Fourrier, Thomas Wolf, Yann LeCun, and Thomas Scialom. GAIA: A benchmark for general AI assistants. InThe Twelfth International Conference on Learning Representations. OpenReview.net, 2024. URL https://openreview.net/forum? id=fibxvahvs3
work page 2024
-
[25]
MiniMax M2.7: Early echoes of self-evolution, 2026
MiniMax. MiniMax M2.7: Early echoes of self-evolution, 2026. URLhttps://www.minimax. io/news/minimax-m27-en. Accessed: 2026-05-06
work page 2026
-
[26]
Kimi K2.6 tech blog: Advancing open-source coding, 2026
Moonshot AI. Kimi K2.6 tech blog: Advancing open-source coding, 2026. URL https: //www.kimi.com/blog/kimi-k2-6. Accessed: 2026-05-06
work page 2026
-
[27]
OpenAI. GPT-5.5 system card, 2026. URL https://openai.com/index/ gpt-5-5-system-card/. Accessed: 2026-05-06
work page 2026
-
[28]
Long Phan, Alice Gatti, Nathaniel Li, Adam Khoja, Ryan Kim, Richard Ren, Jason Hausenloy, Oliver Zhang, Mantas Mazeika, Dan Hendrycks, et al. A benchmark of expert-level academic questions to assess AI capabilities.Nature, 649(8099):1139–1146, 2026. doi: 10.1038/ s41586-025-09962-4. URLhttps://doi.org/10.1038/s41586-025-09962-4
work page internal anchor Pith review doi:10.1038/s41586-025-09962-4 2026
-
[29]
David Rein, Betty Li Hou, Asa Cooper Stickland, Jackson Petty, Richard Yuanzhe Pang, Julien Dirani, Julian Michael, and Samuel R. Bowman. GPQA: A graduate-level google- proof q&a benchmark. InThe Twelfth International Conference on Learning Representations. OpenReview.net, 2024. URLhttps://openreview.net/forum?id=TGnEMgh4PB
work page 2024
-
[30]
RAGChecker: A fine-grained framework for diagnosing retrieval-augmented generation
Dongyu Ru, Lin Qiu, Xiangkun Hu, Tianhang Zhang, Peng Shi, Shuaichen Chang, Cheng Jiayang, Cunxiang Wang, Shichao Sun, Huanyu Li, Zizhao Zhang, Binjie Wang, Jiarong Jiang, Tong He, Zhiguo Wang, Pengfei Liu, Yue Zhang, and Zheng Zhang. RAGChecker: A fine-grained framework for diagnosing retrieval-augmented generation. InAdvances in Neural Information Proce...
-
[31]
doi: 10.52202/079017-0692. URL https://proceedings.neurips.cc/paper_ files/paper/2024/file/27245589131d17368cccdfa990cbf16e-Paper-Datasets_ and_Benchmarks_Track.pdf
-
[32]
PaperBench: Evaluating AI's Ability to Replicate AI Research
Giulio Starace, Oliver Jaffe, Dane Sherburn, James Aung, Jun Shern Chan, Leon Maksin, Rachel Dias, Evan Mays, Benjamin Kinsella, Wyatt Thompson, Johannes Heidecke, Amelia Glaese, and Tejal Patwardhan. PaperBench: Evaluating AI’s ability to replicate AI research. arXiv preprint arXiv:2504.01848, 2025. doi: 10.48550/arXiv.2504.01848. URL https: //arxiv.org/...
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2504.01848 2025
-
[33]
Wei, Jason Wei, Chris Tar, Yun-Hsuan Sung, Denny Zhou, Quoc V
Tu Vu, Mohit Iyyer, Xuezhi Wang, Noah Constant, Jerry Wei, Jason Wei, Chris Tar, Yun-Hsuan Sung, Denny Zhou, Quoc Le, and Thang Luong. FreshLLMs: Refreshing large language models with search engine augmentation.arXiv preprint arXiv:2310.03214, 2023. URL https://arxiv.org/abs/2310.03214
-
[34]
MMLU-Pro: A more robust and challenging multi-task language understanding benchmark
Yubo Wang, Xueguang Ma, Ge Zhang, Yuansheng Ni, Abhranil Chandra, Shiguang Guo, Weiming Ren, Aaran Arulraj, Xuan He, Ziyan Jiang, Tianle Li, Max Ku, Kai Wang, Alex Zhuang, Rongqi Fan, Xiang Yue, and Wenhu Chen. MMLU-Pro: A more robust and challenging multi-task language understanding benchmark. InAdvances in Neural Information Processing Systems, volume 3...
-
[35]
URL https://proceedings.neurips.cc/paper_files/paper/2024/file/ ad236edc564f3e3156e1b2feafb99a24-Paper-Datasets_and_Benchmarks_Track. pdf. 16
work page 2024
-
[36]
BrowseComp: A Simple Yet Challenging Benchmark for Browsing Agents
Jason Wei, Zhiqing Sun, Spencer Papay, Scott McKinney, Jeffrey Han, Isa Fulford, Hyung Won Chung, Alex Tachard Passos, William Fedus, and Amelia Glaese. Browsecomp: A simple yet challenging benchmark for browsing agents.arXiv preprint arXiv:2504.12516, 2025. URL https://arxiv.org/abs/2504.12516
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[37]
Xiaomi. MiMo-V2.5-Pro, 2026. URL https://mimo.xiaomi.com/mimo-v2-5-pro . Ac- cessed: 2026-05-06
work page 2026
-
[38]
Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio, William Cohen, Ruslan Salakhutdinov, and Christopher D. Manning. HotpotQA: A dataset for diverse, explainable multi-hop question answering. InProceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 2369–2380. Association for Computational Linguistics, 2018. doi: 10....
work page 2018
-
[39]
τ-bench: A benchmark for tool-agent-user interaction in real-world domains
Shunyu Yao, Noah Shinn, Pedram Razavi, and Karthik Narasimhan. τ-bench: A benchmark for tool-agent-user interaction in real-world domains. InThe Thir- teenth International Conference on Learning Representations. OpenReview.net,
-
[40]
URL https://proceedings.iclr.cc/paper_files/paper/2025/hash/ 1b126cc38b8638e07bef37e7b2bb72bf-Abstract-Conference.html
work page 2025
-
[41]
Zhipu AI. GLM-5.1 model card, 2026. URL https://www.modelscope.cn/models/ ZhipuAI/GLM-5.1. Accessed: 2026-05-06
work page 2026
-
[42]
BrowseComp-ZH: Benchmarking Web Browsing Ability of Large Language Models in Chinese
Peilin Zhou, Bruce Leon, Xiang Ying, Can Zhang, Yifan Shao, Qichen Ye, Dading Chong, Zhiling Jin, Chenxuan Xie, Meng Cao, Yuxin Gu, Sixin Hong, Jing Ren, Jian Chen, Chao Liu, and Yining Hua. BrowseComp-ZH: Benchmarking web browsing ability of large language models in chinese.arXiv preprint arXiv:2504.19314, 2025. doi: 10.48550/arXiv.2504.19314. URLhttps:/...
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2504.19314 2025
-
[43]
Shuyan Zhou, Frank F. Xu, Hao Zhu, Xuhui Zhou, Robert Lo, Abishek Sridhar, Xianyi Cheng, Tianyue Ou, Yonatan Bisk, Daniel Fried, Uri Alon, and Graham Neubig. WebArena: A realistic web environment for building autonomous agents. InThe Twelfth International Conference on Learning Representations. OpenReview.net, 2024. URL https://openreview.net/forum? id=oK...
work page 2024
-
[44]
Conduct systematic searches to locate relevant sources
-
[45]
Visit and analyze authoritative sources carefully
-
[46]
Synthesize information from multiple sources as needed
-
[47]
Provide your final answer in <answer></answer> tags Final answer format: <answer>Your answer here</answer> This prompt asks for systematic research without giving domain-specific hints. It matches expert search settings where users have questions but no pre-identified sources. E.3 Scoring Methodology Each question specifies detailed grading criteria with ...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.