Recognition: 2 theorem links
· Lean TheoremDeepResearch Bench: A Comprehensive Benchmark for Deep Research Agents
Pith reviewed 2026-05-16 08:04 UTC · model grok-4.3
The pith
DeepResearch Bench supplies 100 PhD-level tasks across 22 fields plus two evaluation methods that align with human judgment for deep research agents.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
DeepResearch Bench consists of 100 PhD-level research tasks, each crafted by domain experts across 22 distinct fields. The benchmark addresses evaluation complexity through two novel methodologies: a reference-based method with adaptive criteria that assesses the quality of generated research reports, and a citation-based framework that evaluates information retrieval and collection by measuring effective citation count and overall citation accuracy.
What carries the argument
The central objects are the set of 100 expert-authored tasks together with the dual evaluation frameworks, where the reference-based method adapts scoring criteria to reports and the citation method quantifies retrieval success through accurate and relevant citations.
If this is right
- Agents can be ranked and compared on identical, expert-defined multistep research problems rather than ad-hoc queries.
- Report quality receives consistent scoring through adaptive reference criteria instead of subjective review alone.
- Retrieval performance is isolated and measured via citation count and accuracy, separating collection skill from synthesis skill.
- Open release of the tasks and evaluation code allows any group to run the same tests and track incremental gains.
- The benchmark spans 22 fields, enabling assessment of whether agent capabilities generalize beyond narrow domains.
Where Pith is reading between the lines
- Developers may begin optimizing agents specifically against the benchmark scores, which could accelerate capability gains on full research workflows.
- The citation-focused evaluation might transfer to measuring source fidelity in other long-form generation systems such as literature reviews or policy briefs.
- Wider use could encourage creation of similar workflow-level benchmarks in adjacent areas like data analysis pipelines or experimental design.
- Gaps revealed by the tasks could highlight specific planning or synthesis weaknesses that current training regimes overlook.
Load-bearing premise
The 100 tasks created by experts across 22 fields represent genuine deep-research challenges and the two evaluation methods align with human judgment without introducing systematic bias or undisclosed tuning.
What would settle it
Independent human raters scoring the same set of agent-generated reports produce results that diverge markedly from the scores returned by either proposed methodology, or the tasks fail to capture the structure of actual open-ended research problems encountered by PhD-level researchers.
read the original abstract
Deep Research Agents are a prominent category of LLM-based agents. By autonomously orchestrating multistep web exploration, targeted retrieval, and higher-order synthesis, they transform vast amounts of online information into analyst-grade, citation-rich reports--compressing hours of manual desk research into minutes. However, a comprehensive benchmark for systematically evaluating the capabilities of these agents remains absent. To bridge this gap, we present DeepResearch Bench, a benchmark consisting of 100 PhD-level research tasks, each meticulously crafted by domain experts across 22 distinct fields. Evaluating DRAs is inherently complex and labor-intensive. We therefore propose two novel methodologies that achieve strong alignment with human judgment. The first is a reference-based method with adaptive criteria to assess the quality of generated research reports. The other framework is introduced to evaluate DRA's information retrieval and collection capabilities by assessing its effective citation count and overall citation accuracy. We have open-sourced DeepResearch Bench and key components of these frameworks at https://github.com/Ayanami0730/deep_research_bench to accelerate the development of practical LLM-based agents.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces DeepResearch Bench, a new benchmark comprising 100 PhD-level research tasks crafted by domain experts across 22 fields, to evaluate LLM-based Deep Research Agents (DRAs) that perform multistep web exploration, retrieval, and synthesis into citation-rich reports. It proposes two evaluation methodologies: (1) a reference-based method using adaptive criteria to score generated reports, and (2) a citation-count/accuracy framework to assess retrieval capabilities, both claimed to achieve strong alignment with human judgment. The benchmark and key components are open-sourced.
Significance. If the human-alignment claims for the two methodologies hold after proper validation, the benchmark would address a clear gap in standardized evaluation of complex autonomous research agents, enabling reproducible comparisons and accelerating progress. The open-sourcing of tasks and evaluation components is a concrete strength that supports immediate community use and extension.
major comments (3)
- [§4] §4 (Evaluation Methodologies): The central claim that both the reference-based adaptive-criteria method and the citation-count/accuracy framework 'achieve strong alignment with human judgment' is unsupported; the manuscript provides no correlation coefficients, inter-rater agreement statistics (e.g., Cohen's kappa or ICC), ablation results on criterion derivation, or details on how adaptive criteria were tuned against human raters. This directly weakens the benchmark's claimed utility.
- [§3] §3 (Benchmark Construction): The assertion that the 100 tasks are representative of real deep-research challenges rests solely on expert crafting across 22 fields, with no reported pilot validation, inter-expert agreement metrics, or comparison against existing research-task corpora to demonstrate coverage or difficulty calibration.
- [§5] §5 (Experiments): The evaluation results on DRAs are presented without baseline comparisons to simpler retrieval-augmented systems or human performance ceilings on the same tasks, making it impossible to interpret the absolute scores or the relative advantage of the proposed methodologies.
minor comments (2)
- [Abstract] The abstract and §1 use the phrase 'strong alignment' without quantifying what threshold (e.g., Pearson r > 0.8) is intended; a brief operational definition would improve clarity.
- [§3] Figure 2 (task distribution) and Table 1 (field coverage) would benefit from explicit counts per field and a note on how task difficulty was calibrated.
Simulated Author's Rebuttal
We thank the referee for their thorough review and constructive suggestions. We address each major comment in detail below, and we plan to incorporate revisions to strengthen the manuscript.
read point-by-point responses
-
Referee: [§4] §4 (Evaluation Methodologies): The central claim that both the reference-based adaptive-criteria method and the citation-count/accuracy framework 'achieve strong alignment with human judgment' is unsupported; the manuscript provides no correlation coefficients, inter-rater agreement statistics (e.g., Cohen's kappa or ICC), ablation results on criterion derivation, or details on how adaptive criteria were tuned against human raters. This directly weakens the benchmark's claimed utility.
Authors: We agree that the manuscript would benefit from more explicit quantitative evidence of human alignment. While the development process involved iterative tuning against human raters, the submitted version focused on describing the methodologies without including the full set of validation statistics. In the revision, we will expand §4 to include Pearson and Spearman correlation coefficients between the automated scores and human judgments (computed on a held-out set of 20 reports), inter-rater agreement metrics such as Cohen's kappa among human evaluators, and ablation results showing the impact of different criterion derivation approaches. This will provide the necessary support for the alignment claims. revision: yes
-
Referee: [§3] §3 (Benchmark Construction): The assertion that the 100 tasks are representative of real deep-research challenges rests solely on expert crafting across 22 fields, with no reported pilot validation, inter-expert agreement metrics, or comparison against existing research-task corpora to demonstrate coverage or difficulty calibration.
Authors: The tasks were crafted by PhD-level experts in each field following guidelines to ensure they require deep, multi-step research. We acknowledge the value of additional validation metrics. We will add to §3 a description of the task creation workflow, including a pilot study where a subset of tasks was reviewed by multiple experts for difficulty and relevance, along with inter-expert agreement scores. Additionally, we will include a qualitative comparison to existing benchmarks to demonstrate coverage across research challenges. revision: yes
-
Referee: [§5] §5 (Experiments): The evaluation results on DRAs are presented without baseline comparisons to simpler retrieval-augmented systems or human performance ceilings on the same tasks, making it impossible to interpret the absolute scores or the relative advantage of the proposed methodologies.
Authors: We concur that baselines and human performance references are essential for contextualizing the results. In the revised manuscript, we will augment §5 with evaluations of simpler baselines, such as standard RAG pipelines without agentic planning, and provide human performance estimates on a representative subset of the tasks (where experts completed the tasks under time constraints). This will enable clearer interpretation of the DRA performance and the utility of our evaluation frameworks. revision: yes
Circularity Check
No circularity: benchmark tasks and evaluation methods are externally defined
full rationale
The paper constructs DeepResearch Bench from 100 tasks explicitly crafted by domain experts across 22 fields (external human input) and proposes two evaluation methodologies whose alignment with human judgment is asserted but not derived from any self-fitted parameters, equations, or prior self-citations within the provided text. No self-definitional loops exist (e.g., no metric defined in terms of itself), no fitted inputs are relabeled as predictions, and no uniqueness theorems or ansatzes are smuggled via self-citation. The open-sourcing of components further allows external verification, rendering the derivation chain self-contained against independent benchmarks rather than reducing to its own inputs by construction.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption Domain experts can reliably craft representative PhD-level research tasks across 22 fields
- domain assumption The reference-based adaptive criteria and citation metrics align with human judgment
Forward citations
Cited by 18 Pith papers
-
Cited but Not Verified: Parsing and Evaluating Source Attribution in LLM Deep Research Agents
A new framework parses and evaluates citations in LLM deep research reports across link validity, relevance, and factuality, finding 94%+ link success but only 39-77% factual accuracy.
-
Rethinking Reasoning-Intensive Retrieval: Evaluating and Advancing Retrievers in Agentic Search Systems
BRIGHT-Pro and RTriever-Synth advance reasoning-intensive retrieval by adding multi-aspect evidence evaluation and aspect-decomposed synthetic training, with the fine-tuned RTriever-4B showing gains over its base model.
-
AI scientists produce results without reasoning scientifically
LLM agents execute scientific tasks but fail to follow core scientific reasoning norms such as evidence consideration and belief revision based on refutations.
-
SciPredict: Can LLMs Predict the Outcomes of Scientific Experiments in Natural Sciences?
LLMs predict outcomes of real scientific experiments at 14-26% accuracy, comparable to human experts, but lack calibration on prediction reliability while humans demonstrate strong calibration.
-
DeepFact: Co-Evolving Benchmarks and Agents for Deep Research Factuality
Audit-then-Score evolves factuality benchmarks through verifier-auditor disputes, raising expert accuracy from 60.8% to 90.9% and yielding a new verification agent that outperforms prior methods on deep research reports.
-
RubricEM: Meta-RL with Rubric-guided Policy Decomposition beyond Verifiable Rewards
RubricEM uses rubric-guided stagewise policy decomposition and reflection-based meta-policy evolution to improve long-horizon research agents beyond verifiable rewards.
-
SWE Atlas: Benchmarking Coding Agents Beyond Issue Resolution
SWE Atlas is a benchmark suite for coding agents that evaluates Codebase Q&A, Test Writing, and Refactoring using comprehensive protocols assessing both functional correctness and software engineering quality.
-
LATTICE: Evaluating Decision Support Utility of Crypto Agents
LATTICE is a scalable LLM-judge benchmark for crypto agent decision support that reveals performance trade-offs among real-world copilots across dimensions and tasks.
-
MARCA: A Checklist-Based Benchmark for Multilingual Web Search
MARCA is a bilingual benchmark using 52 questions and validated checklists to evaluate LLM web-search completeness and correctness in English and Portuguese.
-
Towards Knowledgeable Deep Research: Framework and Benchmark
The paper introduces the KDR task, HKA multi-agent framework, and KDR-Bench to enable LLM agents to integrate structured knowledge into deep research reports, with experiments showing outperformance over prior agents.
-
LABBench2: An Improved Benchmark for AI Systems Performing Biology Research
LABBench2 is a more challenging benchmark than LAB-Bench for assessing AI performance on biology research tasks, with frontier models showing accuracy drops of 26-46% across subtasks.
-
ViDR: Grounding Multimodal Deep Research Reports in Source Visual Evidence
ViDR treats source figures as retrievable and verifiable evidence objects in multimodal deep research reports and introduces MMR Bench+ to measure improvements in visual integration and verifiability.
-
Personalized Deep Research: A User-Centric Framework, Dataset, and Hybrid Evaluation for Knowledge Discovery
PDR is a user-context-aware framework for LLM research agents that improves report relevance over static baselines, supported by a new dataset and hybrid evaluation.
-
LLM-Oriented Information Retrieval: A Denoising-First Perspective
Denoising to maximize usable evidence density and verifiability is becoming the primary bottleneck in LLM-oriented information retrieval, conceptualized via a four-stage framework and addressed through a pipeline taxo...
-
Mind DeepResearch Technical Report
MindDR combines a Planning Agent, DeepSearch Agent, and Report Agent with SFT cold-start, Search-RL, Report-RL, and preference alignment to reach competitive scores on research benchmarks using 30B-scale models.
-
Detecting and Correcting Reference Hallucinations in Commercial LLMs and Deep Research Agents
Citation URLs from LLMs and research agents are hallucinated 3-13% of the time and non-resolving 5-18% of the time, with a released tool that reduces failures by 6-79x.
-
Seed1.8 Model Card: Towards Generalized Real-World Agency
Seed1.8 is a new foundation model that adds unified agentic capabilities for search, code execution, and GUI interaction to existing LLM and vision strengths.
-
A Survey of Context Engineering for Large Language Models
The survey organizes Context Engineering into retrieval, processing, management, and integrated systems like RAG and multi-agent setups while identifying an asymmetry where LLMs handle complex inputs well but struggle...
Reference graph
Works this paper leans on
-
[1]
LongWriter: Unleashing 10,000+ Word Generation from Long Context LLMs, August 2024
Yushi Bai, Jiajie Zhang, Xin Lv, Linzhi Zheng, Siqi Zhu, Lei Hou, Yuxiao Dong, Jie Tang, and Juanzi Li. LongWriter: Unleashing 10,000+ Word Generation from Long Context LLMs, August 2024. arXiv:2408.07055 [cs]
-
[2]
MLE-bench: Evaluating Machine Learning Agents on Machine Learning Engineering
Jun Shern Chan, Neil Chowdhury, Oliver Jaffe, James Aung, Dane Sherburn, Evan Mays, Giulio Starace, Kevin Liu, Leon Maksin, Tejal Patwardhan, Lilian Weng, and Aleksander M ˛ adry. MLE-bench: Evaluating Machine Learning Agents on Machine Learning Engineering, February 2025. arXiv:2410.07095 [cs]
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[3]
Baker, Benjamin Burns, Daniel Adu-Ampratwum, Xuhui Huang, Xia Ning, Song Gao, Yu Su, and Huan Sun
Ziru Chen, Shijie Chen, Yuting Ning, Qianheng Zhang, Boshi Wang, Botao Yu, Yifei Li, Zeyi Liao, Chen Wei, Zitong Lu, Vishal Dey, Mingyi Xue, Frazier N. Baker, Benjamin Burns, Daniel Adu-Ampratwum, Xuhui Huang, Xia Ning, Song Gao, Yu Su, and Huan Sun. ScienceAgentBench: Toward Rigorous Assessment of Language Agents for Data-Driven Scientific Discovery, Mar...
-
[4]
deepseek-ai/DeepSeek-V3-0324 · Hugging Face, March 2025
DeepSeek-AI. deepseek-ai/DeepSeek-V3-0324 · Hugging Face, March 2025
work page 2025
-
[5]
DeepSeek-AI, Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, Xiaokang Zhang, Xingkai Yu, Yu Wu, Z. F. Wu, Zhibin Gou, Zhihong Shao, Zhuoshu Li, Ziyi Gao, Aixin Liu, Bing Xue, Bingxuan Wang, Bochao Wu, Bei Feng, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, Damai D...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[6]
Mobile-Bench: An Evaluation Benchmark for LLM-based Mobile Agents, July 2024
Shihan Deng, Weikai Xu, Hongda Sun, Wei Liu, Tao Tan, Jianfeng Liu, Ang Li, Jian Luan, Bin Wang, Rui Yan, and Shuo Shang. Mobile-Bench: An Evaluation Benchmark for LLM-based Mobile Agents, July 2024. arXiv:2407.00993 [cs]
-
[7]
Deep Research is now available on Gemini 2.5 Pro Experimental., April 2025
Gemini Google. Deep Research is now available on Gemini 2.5 Pro Experimental., April 2025
work page 2025
-
[8]
MetaGPT: Meta Programming for A Multi-Agent Collaborative Framework
Sirui Hong, Mingchen Zhuge, Jiaqi Chen, Xiawu Zheng, Yuheng Cheng, Ceyao Zhang, Jinlin Wang, Zili Wang, Steven Ka Shing Yau, Zijuan Lin, Liyang Zhou, Chenyu Ran, Lingfeng Xiao, Chenglin Wu, and Jürgen Schmidhuber. MetaGPT: Meta Programming for A Multi-Agent Collaborative Framework, November 2024. arXiv:2308.00352 [cs]
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[9]
LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code
Naman Jain, King Han, Alex Gu, Wen-Ding Li, Fanjia Yan, Tianjun Zhang, Sida Wang, Armando Solar-Lezama, Koushik Sen, and Ion Stoica. LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code, June 2024. arXiv:2403.07974 [cs]
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[10]
SWE-bench: Can Language Models Resolve Real-World GitHub Issues?
Carlos E. Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan. SWE-bench: Can Language Models Resolve Real-World GitHub Issues?, November 2024. arXiv:2310.06770 [cs]. 10
work page internal anchor Pith review Pith/arXiv arXiv 2024
- [11]
-
[12]
LAB-Bench: Measuring Capabilities of Language Models for Biology Research
Jon M. Laurent, Joseph D. Janizek, Michael Ruzo, Michaela M. Hinks, Michael J. Hammerling, Siddharth Narayanan, Manvitha Ponnapati, Andrew D. White, and Samuel G. Rodriques. LAB-Bench: Measuring Capabili- ties of Language Models for Biology Research, July 2024. arXiv:2407.10362 [cs]
work page internal anchor Pith review arXiv 2024
-
[13]
WebThinker: Empowering Large Reasoning Models with Deep Research Capability, April 2025
Xiaoxi Li, Jiajie Jin, Guanting Dong, Hongjin Qian, Yutao Zhu, Yongkang Wu, Ji-Rong Wen, and Zhicheng Dou. WebThinker: Empowering Large Reasoning Models with Deep Research Capability, April 2025. arXiv:2504.21776 [cs]
-
[14]
AgentBench: Evaluating LLMs as Agents
Xiao Liu, Hao Yu, Hanchen Zhang, Yifan Xu, Xuanyu Lei, Hanyu Lai, Yu Gu, Hangliang Ding, Kaiwen Men, Kejuan Yang, Shudan Zhang, Xiang Deng, Aohan Zeng, Zhengxiao Du, Chenhui Zhang, Sheng Shen, Tianjun Zhang, Yu Su, Huan Sun, Minlie Huang, Yuxiao Dong, and Jie Tang. AgentBench: Evaluating LLMs as Agents, October 2023. arXiv:2308.03688 [cs]
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[15]
Tula Masterman, Sandi Besen, Mason Sawtell, and Alex Chao. The Landscape of Emerging AI Agent Architectures for Reasoning, Planning, and Tool Calling: A Survey, April 2024. arXiv:2404.11584 [cs]
-
[16]
GAIA: a benchmark for General AI Assistants
Grégoire Mialon, Clémentine Fourrier, Craig Swift, Thomas Wolf, Yann LeCun, and Thomas Scialom. GAIA: a benchmark for General AI Assistants, November 2023. arXiv:2311.12983 [cs]
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[17]
Laurent, Benjamin Tenmann, Siddharth Narayanan, Geemi P
Ludovico Mitchener, Jon M. Laurent, Benjamin Tenmann, Siddharth Narayanan, Geemi P. Wellawatte, Andrew White, Lorenzo Sani, and Samuel G. Rodriques. BixBench: a Comprehensive Benchmark for LLM-based Agents in Computational Biology, March 2025. arXiv:2503.00096 [q-bio]
-
[18]
Ruaridh Mon-Williams, Gen Li, Ran Long, Wenqian Du, and Christopher G. Lucas. Embodied large language models enable robots to complete complex tasks in unpredictable environments. Nature Machine Intelligence, 7(4):592–601, April 2025. Publisher: Nature Publishing Group
work page 2025
- [19]
-
[20]
OpenAI, Aaron Jaech, Adam Kalai, Adam Lerer, Adam Richardson, Ahmed El-Kishky, Aiden Low, Alec Helyar, Aleksander Madry, Alex Beutel, Alex Carney, Alex Iftimie, Alex Karpenko, Alex Tachard Passos, Alexander Neitz, Alexander Prokofiev, Alexander Wei, Allison Tam, Ally Bennett, Ananya Kumar, Andre Saraiva, Andrea Vallone, Andrew Duberstein, Andrew Kondrich,...
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[21]
Long Phan, Alice Gatti, Ziwen Han, Nathaniel Li, Josephina Hu, Hugh Zhang, Chen Bo Calvin Zhang, Mohamed Shaaban, John Ling, Sean Shi, Michael Choi, Anish Agrawal, Arnav Chopra, Adam Khoja, Ryan Kim, Richard Ren, Jason Hausenloy, Oliver Zhang, Mantas Mazeika, Dmitry Dodonov, Tung Nguyen, Jaeho Lee, Daron Anderson, Mikhail Doroshenko, Alun Cennyth Stokes, ...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[22]
arXiv preprint arXiv:2501.01257 , year=
Shanghaoran Quan, Jiaxi Yang, Bowen Yu, Bo Zheng, Dayiheng Liu, An Yang, Xuancheng Ren, Bofei Gao, Yibo Miao, Yunlong Feng, Zekun Wang, Jian Yang, Zeyu Cui, Yang Fan, Yichang Zhang, Binyuan Hui, and Junyang Lin. CodeElo: Benchmarking Competition-level Code Generation of LLMs with Human-comparable Elo Ratings, January 2025. arXiv:2501.01257 [cs] version: 2
-
[23]
Haoran Que, Feiyu Duan, Liqun He, Yutao Mou, Wangchunshu Zhou, Jiaheng Liu, Wenge Rong, Zekun Moore Wang, Jian Yang, Ge Zhang, Junran Peng, Zhaoxiang Zhang, Songyang Zhang, and Kai Chen. HelloBench: Evaluating Long Text Generation Capabilities of Large Language Models, September 2024. arXiv:2409.16191 [cs]
-
[24]
Agent Laboratory: Using LLM Agents as Research Assistants, January 2025
Samuel Schmidgall, Yusheng Su, Ze Wang, Ximeng Sun, Jialian Wu, Xiaodong Yu, Jiang Liu, Zicheng Liu, and Emad Barsoum. Agent Laboratory: Using LLM Agents as Research Assistants, January 2025
work page 2025
-
[25]
Kanell, Peter Xu, Omar Khattab, and Monica S
Yijia Shao, Yucheng Jiang, Theodore A. Kanell, Peter Xu, Omar Khattab, and Monica S. Lam. Assisting in Writing Wikipedia-like Articles From Scratch with Large Language Models, April 2024. arXiv:2402.14207 [cs]
-
[26]
MobileAgentBench: An Efficient and User-Friendly Benchmark for Mobile LLM Agents, June 2024
Luyuan Wang, Yongyu Deng, Yiwei Zha, Guodong Mao, Qinmin Wang, Tianchen Min, Wei Chen, and Shoufa Chen. MobileAgentBench: An Efficient and User-Friendly Benchmark for Mobile LLM Agents, June 2024. arXiv:2406.08184 [cs]
-
[27]
A Survey of LLM-based Agents in Medicine: How far are we from Baymax?, February 2025
Wenxuan Wang, Zizhan Ma, Zheng Wang, Chenghan Wu, Wenting Chen, Xiang Li, and Yixuan Yuan. A Survey of LLM-based Agents in Medicine: How far are we from Baymax?, February 2025. arXiv:2502.11211 [cs] version: 1
-
[28]
BrowseComp: A Simple Yet Challenging Benchmark for Browsing Agents
Jason Wei, Zhiqing Sun, Spencer Papay, Scott McKinney, Jeffrey Han, Isa Fulford, Hyung Won Chung, Alex Tachard Passos, William Fedus, and Amelia Glaese. BrowseComp: A Simple Yet Challenging Benchmark for Browsing Agents, April 2025. arXiv:2504.12516 [cs]
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[29]
Organize the Web: Constructing Domains Enhances Pre-Training Data Curation, February 2025
Alexander Wettig, Kyle Lo, Sewon Min, Hannaneh Hajishirzi, Danqi Chen, and Luca Soldaini. Organize the Web: Constructing Domains Enhances Pre-Training Data Curation, February 2025. arXiv:2502.10341 [cs]
-
[30]
WebWalker: Benchmarking LLMs in Web Traversal, January 2025
Jialong Wu, Wenbiao Yin, Yong Jiang, Zhenglin Wang, Zekun Xi, Runnan Fang, Linhai Zhang, Yulan He, Deyu Zhou, Pengjun Xie, and Fei Huang. WebWalker: Benchmarking LLMs in Web Traversal, January 2025. arXiv:2501.07572 [cs]
-
[31]
Prashanth Vijayaraghavan and Deb Roy
Yuning Wu, Jiahao Mei, Ming Yan, Chenliang Li, Shaopeng Lai, Yuran Ren, Zijia Wang, Ji Zhang, Mengyue Wu, Qin Jin, and Fei Huang. WritingBench: A Comprehensive Benchmark for Generative Writing, March 2025. arXiv:2503.05244 [cs]
-
[32]
CSR-Bench: Benchmarking LLM Agents in Deployment of Computer Science Research Repositories
Yijia Xiao, Runhui Wang, Luyang Kong, Davor Golac, and Wei Wang. CSR-Bench: Benchmarking LLM Agents in Deployment of Computer Science Research Repositories. In Luis Chiruzzo, Alan Ritter, and Lu Wang, editors, Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technolo...
work page 2025
-
[33]
Patil, Ion Stoica, and Joseph E
Fanjia Yan, Huanzhi Mao, Charlie Cheng-Jie Ji, Tianjun Zhang, Shishir G. Patil, Ion Stoica, and Joseph E. Gonzalez. Berkeley Function Calling Leaderboard, February 2024. bibtex[howpublished] = {\url{https://gorilla.cs.berkeley.edu/blogs/8_berkeley_function_calling_leaderboard.html}} citationkey: berkeley- function-calling-leaderboard
work page 2024
-
[34]
SWE-agent: Agent-Computer Interfaces Enable Automated Software Engineering
John Yang, Carlos E. Jimenez, Alexander Wettig, Kilian Lieret, Shunyu Yao, Karthik Narasimhan, and Ofir Press. SWE-agent: Agent-Computer Interfaces Enable Automated Software Engineering, November 2024. arXiv:2405.15793 [cs]
work page internal anchor Pith review Pith/arXiv arXiv 2024
- [35]
-
[36]
Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena
Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric P. Xing, Hao Zhang, Joseph E. Gonzalez, and Ion Stoica. Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena, December 2023. arXiv:2306.05685 [cs]. 15
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[37]
Deepresearcher: Scaling deep research via reinforcement learning in real-world environments
Yuxiang Zheng, Dayuan Fu, Xiangkun Hu, Xiaojie Cai, Lyumanshan Ye, Pengrui Lu, and Pengfei Liu. Deep- Researcher: Scaling Deep Research via Reinforcement Learning in Real-world Environments, April 2025. arXiv:2504.03160 [cs]
-
[38]
BrowseComp-ZH: Benchmarking Web Browsing Ability of Large Language Models in Chinese, May 2025
Peilin Zhou, Bruce Leon, Xiang Ying, Can Zhang, Yifan Shao, Qichen Ye, Dading Chong, Zhiling Jin, Chenxuan Xie, Meng Cao, Yuxin Gu, Sixin Hong, Jing Ren, Jian Chen, Chao Liu, and Yining Hua. BrowseComp-ZH: Benchmarking Web Browsing Ability of Large Language Models in Chinese, May 2025. arXiv:2504.19314 [cs]
-
[39]
WebArena: A Realistic Web Environment for Building Autonomous Agents
Shuyan Zhou, Frank F. Xu, Hao Zhu, Xuhui Zhou, Robert Lo, Abishek Sridhar, Xianyi Cheng, Tianyue Ou, Yonatan Bisk, Daniel Fried, Uri Alon, and Graham Neubig. WebArena: A Realistic Web Environment for Building Autonomous Agents, April 2024. arXiv:2307.13854 [cs]
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[40]
BigCodeBench: Benchmarking Code Generation with Diverse Function Calls and Complex Instructions
Terry Yue Zhuo, Minh Chien Vu, Jenny Chim, Han Hu, Wenhao Yu, Ratnadira Widyasari, Imam Nur Bani Yusuf, Haolan Zhan, Junda He, Indraneil Paul, Simon Brunner, Chen Gong, Thong Hoang, Armel Randy Zebaze, Xiaoheng Hong, Wen-Ding Li, Jean Kaddour, Ming Xu, Zhihan Zhang, Prateek Yadav, Naman Jain, Alex Gu, Zhoujun Cheng, Jiawei Liu, Qian Liu, Zijian Wang, Biny...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[42]
Insight: The depth, originality, logic, and value of the analysis and conclusions. 21
-
[44]
Readability: Clarity of structure, fluency of language, effectiveness of data presentation, and overall ease of understanding. Evaluation Formula: Total Score = Comprehensiveness * Comprehensiveness Weight + Insight * Insight Weight + Instruction Following * Instruction Following Weight + Readability * Readability Weight. (Note: The sum of all weights mus...
-
[45]
In-depth Task Analysis: Carefully study the specific content of the ‘<task>‘, its implicit goals, potential difficulties, and the core value of its outcomes
-
[46]
Dynamic Weight Allocation: Based on your analysis, assign weights to the four dimensions (use decimals between 0 and 1, e.g., 0.3). The key is to understand that different tasks have different focuses, and weights must be flexibly adjusted according to task characteristics, not fixed
-
[47]
</instruction> <examples_rationale> The following two examples are provided to demonstrate how to adjust evaluation dimension weights and explain the reasons based on changes in task nature . Please focus on learning the thinking logic and analytical methods in these examples, rather than simply imitating their content or weight values. </examples_rationa...
-
[48]
Comprehensiveness: The breadth, depth, and relevance of information coverage
-
[49]
Insight: The depth, originality, logic, and value of the analysis and conclusions
-
[50]
Instruction Following: Whether the report accurately and completely responds to all requirements and constraints of the task
-
[51]
Readability: Clarity of structure, fluency of language, effectiveness of data presentation, and overall ease of understanding. <task> "{task_prompt}" </task> <instruction> Your Goal: For the Comprehensiveness dimension of this research article, develop a set of detailed, specific, and highly task-relevant evaluation criteria. You need to:
-
[52]
Analyze Task: Deeply analyze the ‘<task>‘ to identify key information areas, perspectives, and depths that must be covered to achieve "comprehensiveness."
-
[53]
Formulate Criteria: Based on the analysis, propose specific evaluation criteria items
-
[54]
Explain Rationale: Provide a brief explanation (‘explanation‘) for each criterion, stating why it is important for assessing the comprehensiveness of this ‘<task>‘
-
[55]
Task-Centric: Analysis, criteria, explanations, and weights must directly relate to the core require- ments and characteristics of the ‘<task>‘
-
[56]
The ‘explanation‘ for each criterion must justify its specific relevance
Well-Justified: The ‘<analysis>‘ section must clearly articulate the overall thinking behind setting these criteria and weights, linking it to the ‘<task>‘. The ‘explanation‘ for each criterion must justify its specific relevance
-
[57]
</instruction> <example_rational> The following example demonstrates how to formulate comprehensiveness criteria based on task require- ments. Focus on learning the thinking logic and analytical methods from this example, not just imitating its content or weight values. </example_rational> <example> <task> "Analyze the impact of remote work trends on comm...
-
[58]
Remote Work Trends & Adoption Data: Coverage of current and projected remote/hybrid work models, adoption rates across industries and demographics
-
[59]
Impact on Commercial Real Estate Sectors : Analysis of effects on office, retail, and industrial spaces, including vacancy rates, leasing trends, and property valuations in major US cities
-
[60]
financial centers, downtown vs
Geographical Variations: Examination of how impacts differ across various major US cities (e.g., tech hubs vs. financial centers, downtown vs. suburban)
-
[61]
Weight allocation should be balanced between the impact analysis (remote work trends, sector impacts, geographical variations) and the investment strategy section, as both are critical to fulfilling the task. Within impact analysis, specific sector impacts and geographical variations are key to actionable insights. </analysis> <json_output> [ {{ "criterio...
-
[64]
Score Separately: Based on your comparative analysis, score each article on each criterion (0-10 points). Scoring Rules For each criterion, score both articles on a scale of 0-10 (continuous values). The score should reflect the quality of performance on that criterion: • 0-2 points: Very poor performance. Almost completely fails to meet the criterion req...
-
[65]
Analyze Each Criterion: Consider how each article fulfills the requirements of each criterion
-
[66]
Comparative Evaluation: Analyze how the two articles perform on each criterion, referencing the content and criterion explanation
-
[67]
Score Separately: Based on your comparative analysis, score each article on each criterion (0-10 points). Scoring Rules For each criterion, score both articles on a scale of 0-10 (continuous values). The score should reflect the quality of performance on that criterion: • 0-2 points: Very poor performance. Almost completely fails to meet the criterion req...
-
[68]
Analyze Each Criterion: Consider how the article fulfills the requirements of each criterion
-
[69]
Analysis and Evaluation: Analyze the article’s performance on each criterion, referencing the content and criterion explanation, noting strengths and weaknesses
-
[70]
Score: Based on your analysis, score the article on each criterion (0-10 points). Scoring Rules For each criterion, score the article on a scale of 0-10 (continuous values). The score should reflect the quality of performance on that criterion: • 0-2 points: Very poor performance. Almost completely fails to meet the criterion requirements. • 2-4 points: P...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.