Time to REFLECT: Can We Trust LLM Judges for Evidence-based Research Agents?
Pith reviewed 2026-05-20 10:12 UTC · model grok-4.3
The pith
Current LLM judges for evidence-based research agents achieve overall accuracies below 55 percent and perform especially poorly on evidence verification.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The paper shows that current LLM judges remain unreliable for assessing deep research agents, with even the best-performing models achieving overall accuracies below 55 percent across reasoning, tool-use, and report-quality failures and with especially poor performance on evidence verification.
What carries the argument
REFLECT benchmark, which defines a taxonomy of process- and outcome-level failure modes and creates test cases by applying controlled and localized interventions to quality-screened agent execution traces.
If this is right
- LLM judges cannot yet be deployed as the sole supervisor for research-agent outputs without additional safeguards.
- Evidence-verification failures require targeted improvements before judges can be trusted on factual grounding.
- Evaluation pipelines will need hybrid or multi-judge setups to compensate for individual model weaknesses.
- Cost-reliability trade-offs must be measured explicitly when choosing which model to use as a judge.
Where Pith is reading between the lines
- High-stakes deployment of research agents may need to retain human review loops until judge accuracy improves.
- The same controlled-intervention approach could be adapted to test judges on other agentic domains such as code generation or data analysis.
- Fine-tuning judges on the specific failure modes catalogued here offers a concrete next step for raising detection rates.
Load-bearing premise
The controlled and localized interventions on quality-screened agent execution traces produce realistic and verifiable instances of the defined failure modes that match real-world agent behavior.
What would settle it
Human experts independently label the same collection of intervened agent traces for the presence and type of each failure; if their labels disagree substantially with the LLM judge outputs, the reliability claim is falsified.
Figures
read the original abstract
Deep research agents increasingly automate complex information-seeking tasks, producing evidence-grounded reports via multi-step reasoning, tool use, and synthesis. Their growing role demands scalable, reliable evaluation, positioning LLM-as-judge as a supervision paradigm for assessing factual accuracy, evidence use, and reasoning quality. Yet the reliability of these judges for deep research agents remains poorly understood, posing a critical meta-evaluation problem: before deploying LLM judges to supervise research agents, we must first evaluate the judges themselves. Existing meta-evaluations fall short in two ways: (1) reliance on coarse, subjective human-preference agreement; (2) focus on instruction-following or verifiable tasks, leaving open-ended agent executions unexplored. To address these gaps, we introduce REFLECT (REliable Fine-grained LLM judge Evaluation via Controlled inTervention), a meta-evaluation benchmark targeting fine-grained failure detection in agentic environments. REFLECT defines a detailed taxonomy of process- and outcome-level failure modes, instantiated by performing controlled and localized interventions on quality-screened agent execution traces. This yields verifiable, comprehensive, and fine-grained instances for validating the judge models. Our experiments show that current LLM judges remain unreliable: even the best-performing models achieve overall accuracies below 55% across reasoning, tool-use, and report-quality failures, with especially poor performance on evidence verification. Together, our taxonomy and findings expose systematic judge limitations, reveal tradeoffs in cost and reliability, and offer actionable guidance for building more reliable evaluation pipelines for deep research agents.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces REFLECT, a meta-evaluation benchmark for LLM judges assessing deep research agents. It defines a taxonomy of process- and outcome-level failure modes in reasoning, tool use, and report quality, then instantiates verifiable instances via controlled, localized interventions on quality-screened agent execution traces. Experiments show that even the strongest LLM judges achieve overall accuracies below 55% on these instances, with notably weak performance on evidence verification, and the work offers guidance on tradeoffs and improved evaluation pipelines.
Significance. If the REFLECT instances prove to be realistic proxies for naturally occurring agent failures, the results would meaningfully advance meta-evaluation practices in agentic NLP by exposing systematic limitations of current LLM judges on open-ended, evidence-grounded tasks. The fine-grained taxonomy and benchmark construction represent a concrete contribution that could inform hybrid or specialized judge designs, though the overall impact hinges on demonstrating that the intervention method preserves the distribution and subtlety of real multi-step agent errors.
major comments (2)
- [Abstract and Section 3 (REFLECT construction)] Abstract and method description: The headline finding that LLM judges remain unreliable (overall accuracies below 55%) is load-bearing on the assumption that controlled interventions on screened traces produce realistic and verifiable failure instances. No quantitative check—such as feature overlap statistics, error-rate calibration against organic traces, or human indistinguishability tests—is reported to confirm that the localized edits match the contextual embedding and subtlety of naturally arising failures in multi-step tool use and synthesis.
- [Experimental Results] Experimental results: The accuracy claims, including the especially poor performance on evidence verification, are presented without error bars, confidence intervals, or per-category dataset statistics (e.g., instance counts and balance across reasoning/tool/report failures). This omission prevents assessment of whether the sub-55% figures are robust or sensitive to sampling variation.
minor comments (1)
- [Taxonomy section] The taxonomy definitions would benefit from additional concrete examples of each failure mode to improve clarity and reproducibility for readers implementing similar interventions.
Simulated Author's Rebuttal
We thank the referee for their constructive and detailed feedback. We address each major comment below and describe the revisions planned for the next version of the manuscript.
read point-by-point responses
-
Referee: [Abstract and Section 3 (REFLECT construction)] Abstract and method description: The headline finding that LLM judges remain unreliable (overall accuracies below 55%) is load-bearing on the assumption that controlled interventions on screened traces produce realistic and verifiable failure instances. No quantitative check—such as feature overlap statistics, error-rate calibration against organic traces, or human indistinguishability tests—is reported to confirm that the localized edits match the contextual embedding and subtlety of naturally arising failures in multi-step tool use and synthesis.
Authors: We agree that the realism of the controlled interventions is central to the benchmark's validity. The construction relies on quality-screened traces and localized edits chosen to maintain contextual coherence while enabling verification. However, the original submission did not include quantitative validation such as feature-overlap statistics or human indistinguishability tests. In the revision we will add a dedicated subsection to Section 3 that reports a human study on a random sample of 100 instances, comparing perceived naturalness and subtlety against unmodified traces, together with basic distributional statistics on intervention types. This will directly address the concern while preserving the verifiability advantage of the method. revision: yes
-
Referee: [Experimental Results] Experimental results: The accuracy claims, including the especially poor performance on evidence verification, are presented without error bars, confidence intervals, or per-category dataset statistics (e.g., instance counts and balance across reasoning/tool/report failures). This omission prevents assessment of whether the sub-55% figures are robust or sensitive to sampling variation.
Authors: We accept that the absence of uncertainty estimates and dataset statistics limits interpretability. In the revised manuscript we will augment the experimental section with bootstrap-derived error bars and 95% confidence intervals for all reported accuracies. We will also add a table (and corresponding text) listing the exact instance counts per failure category and the balance across reasoning, tool-use, and report-quality failures. These additions will allow readers to evaluate the robustness of the sub-55% results. revision: yes
Circularity Check
No circularity; empirical benchmark from new interventions on external traces
full rationale
The paper constructs REFLECT by defining a failure-mode taxonomy and applying controlled localized interventions to quality-screened agent execution traces, then measures LLM-judge accuracy on the resulting instances. The headline result (accuracies below 55%) is obtained directly from these experiments rather than from any self-referential equation, fitted parameter renamed as prediction, or load-bearing self-citation chain. No derivation step reduces to its own inputs by construction; the work is self-contained against the newly generated benchmark data.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Controlled and localized interventions on quality-screened agent execution traces yield verifiable, comprehensive, and fine-grained failure instances.
invented entities (1)
-
REFLECT benchmark
no independent evidence
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/AbsoluteFloorClosure.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
controlled and localized interventions on quality-screened agent execution traces... yields verifiable, comprehensive, and fine-grained instances
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
taxonomy of process- and outcome-level failure modes... controlled intervention
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
gpt-oss-120b & gpt-oss-20b Model Card
S. Agarwal et al. gpt-oss-120b & gpt-oss-20b model card.arXiv preprint arXiv:2508.10925, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[2]
Anthropic. Introducing claude haiku 4.5. Anthropic release announcement, 2025
work page 2025
-
[3]
Anthropic. Introducing claude opus 4.7. Anthropic release announcement, 2026
work page 2026
-
[4]
Benchmarking large language mod- els in retrieval-augmented generation
Jiawei Chen, Hongyu Lin, Xianpei Han, and Le Sun. Benchmarking large language mod- els in retrieval-augmented generation. InProceedings of the AAAI Conference on Artificial Intelligence, 2024
work page 2024
-
[5]
João Coelho, Jingjie Ning, Jingyuan He, Kangrui Mao, Abhijay Paladugu, Pranav Setlur, Jiahe Jin, Jamie Callan, João Magalhães, Bruno Martins, and Chenyan Xiong. Deepresearchgym: A free, transparent, and reproducible evaluation sandbox for deep research.arXiv preprint arXiv:2505.19253, 2025
-
[6]
DeepResearch Bench: A Comprehensive Benchmark for Deep Research Agents
Mingxuan Du, Benfeng Xu, Chiwei Zhu, Xiaorui Wang, and Zhendong Mao. Deepresearch bench: A comprehensive benchmark for deep research agents.arXiv preprint arXiv:2506.11763, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[7]
Alpacafarm: A simulation framework for methods that learn from human feedback
Yann Dubois, Xuechen Li, Rohan Taori, Tianyi Zhang, Ishaan Gulrajani, Jimmy Ba, Carlos Guestrin, Percy Liang, and Tatsunori Hashimoto. Alpacafarm: A simulation framework for methods that learn from human feedback. InThirty-seventh Conference on Neural Information Processing Systems, 2023. URLhttps://openreview.net/forum?id=4hturzLcKX. 10
work page 2023
-
[8]
RAGAS: Automated evaluation of retrieval augmented generation
Shahul Es, Jithin James, Luis Espinosa-Anke, and Steven Schockaert. RAGAS: Automated evaluation of retrieval augmented generation. InProceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics, 2024
work page 2024
-
[9]
Are we on the right way to assessing LLM-as-a-judge?arXiv preprint arXiv:2512.16041, 2025
Yuanning Feng, Sinan Wang, Zhengxiang Cheng, Yao Wan, and Dongping Chen. Are we on the right way to assessing LLM-as-a-judge?arXiv preprint arXiv:2512.16041, 2025. URL https://arxiv.org/abs/2512.16041
-
[10]
FutureSearch, :, Nikos I. Bosse, Jon Evans, Robert G. Gambee, Daniel Hnyk, Peter Mühlbacher, Lawrence Phillips, Dan Schwarz, and Jack Wildman. Deep research bench: Evaluating ai web research agents, 2025. URLhttps://arxiv.org/abs/2506.06287
-
[11]
Enabling large language models to generate text with citations
Tianyu Gao, Howard Yen, Jiatong Yu, and Danqi Chen. Enabling large language models to generate text with citations. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, 2023
work page 2023
-
[12]
Gemma Team et al. Gemma 3 technical report.arXiv preprint arXiv:2503.19786, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[13]
Gemini 2.0 is now available to everyone
Google. Gemini 2.0 is now available to everyone. Google Blog, feb 2025. URL https: //blog.google/innovation-and-ai/models-and-research/google-deepmind/ge mini-model-updates-february-2025/. Accessed: 2026-05-06
work page 2025
-
[14]
Gemini 2.5 flash is now in preview
Google. Gemini 2.5 flash is now in preview. https://blog.google/products-and-p latforms/products/gemini/gemini-2-5-flash-preview/ , April 2025. Accessed: 2026-05-06
work page 2025
-
[15]
Google DeepMind. Gemini 3.1 pro model card. https://deepmind.google/models/mod el-cards/gemini-3-1-pro/, February 2026. Accessed: 2026-05-06
work page 2026
-
[16]
Aaron Grattafiori et al. The llama 3 herd of models.arXiv preprint arXiv:2407.21783, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[17]
Janghoon Han, Heegyu Kim, Changho Lee, Dahm Lee, Min Hyung Park, Hosung Song, Stanley Jungkyu Choi, Moontae Lee, and Honglak Lee. DEER: A benchmark for evaluating deep research agents on expert report generation.arXiv preprint arXiv:2512.17776, 2025
-
[18]
Step-DeepResearch technical report.arXiv preprint arXiv:2512.20491, 2025
Chen Hu, Haikuo Du, Heng Wang, Lin Lin, Mingrui Chen, Peng Liu, Ruihang Miao, Tianchi Yue, Wang You, Wei Ji, Wei Yuan, Wenjin Deng, Xiaojian Yuan, Xiaoyun Zhang, Xiangyu Liu, Xikai Liu, Yanming Xu, Yicheng Cao, Yifei Zhang, Yongyao Wang, et al. Step-DeepResearch technical report.arXiv preprint arXiv:2512.20491, 2025
-
[19]
MetaTool benchmark for large language models: Deciding whether to use tools and which to use
Yue Huang, Jiawen Shi, Yuan Li, Chenrui Fan, Siyuan Wu, Qihui Zhang, Yixin Liu, Pan Zhou, Yao Wan, Neil Zhenqiang Gong, and Lichao Sun. MetaTool benchmark for large language models: Deciding whether to use tools and which to use. InThe Twelfth International Conference on Learning Representations, 2024
work page 2024
-
[20]
Jena D. Hwang, Varsha Kishore, Amanpreet Singh, Dany Haddad, Aakanksha Naik, Malachi Hamada, Jonathan Bragg, Mike D’Arcy, Daniel S. Weld, Lucy Lu Wang, Doug Downey, and Sergey Feldman. Deep research, shallow evaluation: A case study in meta-evaluation for long-form qa benchmarks, 2026. URLhttps://arxiv.org/abs/2603.06942
-
[21]
Toolscan: A benchmark for characterizing errors in tool-use LLMs, 2025
Shirley Kokane, Ming Zhu, Tulika Manoj Awalgaonkar, Jianguo Zhang, Akshara Prabhakar, Thai Quoc Hoang, Zuxin Liu, Rithesh R N, Liangwei Yang, Weiran Yao, Juntao Tan, Zhiwei Liu, Shelby Heinecke, Huan Wang, Juan Carlos Niebles, Caiming Xiong, and Silvio Savarese. Toolscan: A benchmark for characterizing errors in tool-use LLMs, 2025. URL https: //openrevie...
work page 2025
-
[22]
Tian Lan, Bin Zhu, Qianghuai Jia, Junyang Ren, Haijun Li, Longyue Wang, Zhao Xu, Weihua Luo, and Kaifu Zhang. Deepwidesearch: Benchmarking depth and width in agentic information seeking.arXiv preprint arXiv:2510.20168, 2025
-
[23]
Retrieval-augmented generation for knowledge-intensive nlp tasks
Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, Sebastian Riedel, and Douwe Kiela. Retrieval-augmented generation for knowledge-intensive nlp tasks. InAdvances in Neural Information Processing Systems, 2020. 11
work page 2020
-
[24]
Minghao Li, Ying Zeng, Zhihao Cheng, Cong Ma, and Kai Jia. ReportBench: Evaluating deep research agents via academic survey tasks.arXiv preprint arXiv:2508.15804, 2025
-
[25]
Xuzhao Li, Xuchen Li, Shiyu Hu, Yongzhen Guo, and Wentao Zhang. VerifyBench: A systematic benchmark for evaluating reasoning verifiers across domains.arXiv preprint arXiv:2507.09884, 2025
-
[26]
G- Eval: NLG evaluation using GPT-4 with better human alignment
Yang Liu, Dan Iter, Yichong Xu, Shuohang Wang, Ruochen Xu, and Chenguang Zhu. G- Eval: NLG evaluation using GPT-4 with better human alignment. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 2511–2522. Association for Computational Linguistics, 2023
work page 2023
-
[27]
ReIFE: Re-evaluating instruction-following evaluation
Yixin Liu, Kejian Shi, Alexander Fabbri, Yilun Zhao, PeiFeng Wang, Chien-Sheng Wu, Shafiq Joty, and Arman Cohan. ReIFE: Re-evaluating instruction-following evaluation. In Luis Chiruzzo, Alan Ritter, and Lu Wang, editors,Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Languag...
work page 2025
-
[28]
MMLU-CF: A contamination- free multi-task language understanding benchmark
Association for Computational Linguistics. ISBN 979-8-89176-189-6. doi: 10.18653/v1/ 2025.naacl-long.610. URLhttps://aclanthology.org/2025.naacl-long.610/
-
[29]
Jiarui Lu, Thomas Holleis, Yizhe Zhang, Bernhard Aumayer, Feng Nan, Felix Bai, Shuang Ma, Shen Ma, Mengyu Li, Guoli Yin, Zirui Wang, and Ruoming Pang. ToolSandbox: A stateful, conversational, interactive evaluation benchmark for LLM tool use capabilities. InFindings of the Association for Computational Linguistics: NAACL 2025, 2025
work page 2025
-
[30]
Xing Han Lù, Amirhossein Kazemnejad, Nicholas Meade, Arkil Patel, Dongchan Shin, Ale- jandra Zambrano, Karolina Sta´nczak, Peter Shaw, Christopher J. Pal, and Siva Reddy. Agen- tRewardBench: Evaluating automatic evaluations of web agent trajectories.arXiv preprint arXiv:2504.08942, 2025. URLhttps://arxiv.org/abs/2504.08942
-
[31]
Smith, Hannaneh Hajishirzi, and Nathan Lambert
Saumya Malik, Valentina Pyatkin, Sander Land, Jacob Morrison, Noah A. Smith, Hannaneh Hajishirzi, and Nathan Lambert. Rewardbench 2: Advancing reward model evaluation. InThe Fourteenth International Conference on Learning Representations, 2026
work page 2026
-
[32]
An expert schema for evaluating large language model errors in scholarly question-answering systems
Anna Martin-Boyle, William Humphreys, Martha Brown, Cara Leckey, and Harmanpreet Kaur. An expert schema for evaluating large language model errors in scholarly question-answering systems. InProceedings of the 2026 CHI Conference on Human Factors in Computing Systems, 2026
work page 2026
-
[33]
FActScore: Fine-grained atomic evaluation of factual precision in long form text generation
Sewon Min, Kalpesh Krishna, Xinxi Lyu, Mike Lewis, Wen-tau Yih, Pang Wei Koh, Mohit Iyyer, Luke Zettlemoyer, and Hannaneh Hajishirzi. FActScore: Fine-grained atomic evaluation of factual precision in long form text generation. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, 2023
work page 2023
-
[34]
the moon is made of marshmallows
Yifei Ming, Senthil Purushwalkam, Shrey Pandit, Zixuan Ke, Xuan-Phi Nguyen, Caiming Xiong, and Shafiq Joty. FaithEval: Can your language model stay faithful to context, even if “the moon is made of marshmallows”. InThe Thirteenth International Conference on Learning Representations, 2025
work page 2025
-
[35]
WebGPT: Browser-assisted question-answering with human feedback
Reiichiro Nakano, Jacob Hilton, Suchir Balaji, Jeff Wu, Long Ouyang, et al. Webgpt: Browser- assisted question-answering with human feedback.arXiv preprint arXiv:2112.09332, 2021
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[36]
RAGTruth: A hallucination corpus for developing trustworthy retrieval- augmented language models
Cheng Niu, Yuanhao Wu, Juno Zhu, Siliang Xu, Kashun Shum, Randy Zhong, Juntong Song, and Tong Zhang. RAGTruth: A hallucination corpus for developing trustworthy retrieval- augmented language models. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics, 2024
work page 2024
-
[37]
OpenAI. GPT-5 mini. https://developers.openai.com/api/docs/models/gpt-5-m ini, August 2025. Model version: gpt-5-mini-2025-08-07
work page 2025
-
[38]
OpenAI. Introducing gpt-5.3-codex. OpenAI release and API documentation, 2026
work page 2026
- [39]
-
[40]
Shishir G Patil, Huanzhi Mao, Fanjia Yan, Charlie Cheng-Jie Ji, Vishnu Suresh, Ion Stoica, and Joseph E. Gonzalez. The berkeley function calling leaderboard (BFCL): From tool use to agentic evaluation of large language models, 2025. URL https://openreview.net/forum ?id=2GmDdhBdDk
work page 2025
-
[41]
Direct preference optimization: Your language model is secretly a reward model,
Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D Manning, Stefano Ermon, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model,
-
[42]
URLhttps://openreview.net/forum?id=HPuSIXJaa9
-
[43]
ARES: An automated evaluation framework for retrieval-augmented generation systems
Jon Saad-Falcon, Omar Khattab, Christopher Potts, and Matei Zaharia. ARES: An automated evaluation framework for retrieval-augmented generation systems. InProceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2024
work page 2024
-
[44]
Localizing and miti- gating errors in long-form question answering
Rachneet Singh Sachdeva, Yixiao Song, Mohit Iyyer, and Iryna Gurevych. Localizing and miti- gating errors in long-form question answering. InFindings of the Association for Computational Linguistics: ACL 2025, pages 20437–20469, 2025
work page 2025
-
[45]
Toolformer: Language models can teach themselves to use tools
Timo Schick, Jane Dwivedi-Yu, Roberto Dessi, Roberta Raileanu, Maria Lomeli, Eric Hambro, Luke Zettlemoyer, Nicola Cancedda, and Thomas Scialom. Toolformer: Language models can teach themselves to use tools. InThirty-seventh Conference on Neural Information Processing Systems, 2023. URLhttps://openreview.net/forum?id=Yacmpz84TH
work page 2023
-
[46]
Jiaqi Shao, Yuxiang Lin, Munish Prasad Lohani, Yufeng Miao, and Bing Luo. Do LLM agents know how to ground, recover, and assess? a benchmark for epistemic competence in information-seeking agents.arXiv preprint arXiv:2509.22391, 2025
-
[47]
Rulin Shao, Akari Asai, Shannon Zejiang Shen, Hamish Ivison, Varsha Kishore, Jingming Zhuo, Xinran Zhao, Molly Park, Samuel G. Finlayson, David Sontag, Tyler Murray, Sewon Min, Pradeep Dasigi, Luca Soldaini, Faeze Brahman, Wen tau Yih, Tongshuang Wu, Luke Zettlemoyer, Yoon Kim, Hannaneh Hajishirzi, and Pang Wei Koh. Dr tulu: Reinforcement learning with ev...
work page 2025
-
[48]
ResearchRubrics : A benchmark of prompts and rubrics for evaluating deep research agents
Manasi Sharma, Chen Bo Calvin Zhang, Chaithanya Bandi, Clinton Wang, Ankit Aich, Huy Nghiem, Tahseen Rabbani, Ye Htet, Brian Jang, Sumana Basu, et al. Researchrubrics: A benchmark of prompts and rubrics for evaluating deep research agents.arXiv preprint arXiv:2511.07685, 2025
-
[49]
Judgebench: A benchmark for evaluating LLM-based judges
Sijun Tan, Siyuan Zhuang, Kyle Montgomery, William Yuan Tang, Alejandro Cuadron, Chen- guang Wang, Raluca Popa, and Ion Stoica. Judgebench: A benchmark for evaluating LLM-based judges. InThe Thirteenth International Conference on Learning Representations, 2025
work page 2025
-
[50]
Tongyi DeepResearch Technical Report
Tongyi DeepResearch Team, Baixuan Li, Bo Zhang, Dingchu Zhang, Fei Huang, Guangyu Li, Guoxin Chen, Huifeng Yin, Jialong Wu, Jingren Zhou, et al. Tongyi deepresearch technical report.arXiv preprint arXiv:2510.24701, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[51]
Yibo Wang, Lei Wang, Yue Deng, Keming Wu, Yao Xiao, Huanjin Yao, Liwei Kang, Hai Ye, Yongcheng Jing, and Lidong Bing. DeepResearchEval: An automated framework for deep research task construction and agentic evaluation.arXiv preprint arXiv:2601.09688, 2026
-
[52]
Long-form factuality in large language models
Jerry Wei, Chengrun Yang, Xinying Song, Yifeng Lu, Nathan Zixia Hu, Jie Huang, Dustin Tran, Daiyi Peng, Ruibo Liu, Da Huang, Cosmo Du, and Quoc V Le. Long-form factuality in large language models. InThe Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024. URLhttps://openreview.net/forum?id=4M9f8VMt2C
work page 2024
-
[53]
An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, Chujie Zheng, Dayiheng Liu, Fan Zhou, Fei Huang, Feng Hu, Hao Ge, Haoran Wei, Huan Lin, Jialong Tang, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jing Zhou, Jingren Zhou, Junyang Lin, Kai Dang, Keqin Bao, Kexin Yang, ...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[54]
React: Synergizing reasoning and acting in language models
Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. React: Synergizing reasoning and acting in language models. InInternational Conference on Learning Representations, 2023
work page 2023
-
[55]
Yang Yao, Yixu Wang, Yuxuan Zhang, Yi Lu, Tianle Gu, Lingyu Li, Dingyi Zhao, Keming Wu, Haozhe Wang, Ping Nie, Yan Teng, and Yingchun Wang. A rigorous benchmark with multidimensional evaluation for deep research agents: From answers to reports.arXiv preprint arXiv:2510.02190, 2025
-
[56]
Fangda Ye, Yuxin Hu, Pengxiang Zhu, Yibo Li, Ziqi Jin, Yao Xiao, Yibo Wang, Lei Wang, Zhen Zhang, Lu Wang, Yue Deng, Bin Wang, Yifan Zhang, Liangcai Su, Xinyu Wang, He Zhao, Chen Wei, Qiang Ren, Bryan Hooi, An Bo, Shuicheng Yan, and Lidong Bing. MiroEval: Benchmarking multimodal deep research agents in process and outcome.arXiv preprint arXiv:2603.28407, 2026
-
[57]
Yifei, Allen Chang, Chaitanya Malaviya, and Mark Yatskar
Li S. Yifei, Allen Chang, Chaitanya Malaviya, and Mark Yatskar. Researchqa: Evaluating scholarly question answering at scale across 75 fields with survey-mined questions and rubrics,
- [58]
-
[59]
Automatic evaluation of attribution by large language models
Xiang Yue, Boshi Wang, Ziru Chen, Kai Zhang, Yu Su, and Huan Sun. Automatic evaluation of attribution by large language models. InFindings of the Association for Computational Linguistics: EMNLP 2024, 2024
work page 2024
-
[60]
Evaluating large language models at evaluating instruction following, 2024
Zhiyuan Zeng, Jiatong Yu, Tianyu Gao, Yu Meng, Tanya Goyal, and Danqi Chen. Evaluating large language models at evaluating instruction following, 2024. URL https://arxiv.org/ abs/2310.07641
-
[61]
Yuhao Zhan, Tianyu Fan, Linxuan Huang, Zirui Guo, and Chao Huang. Why your deep research agent fails? on hallucination evaluation in full research trajectory.arXiv preprint arXiv:2601.22984, 2026
-
[62]
Chen Zhang, Kuicai Dong, Dexun Li, Wenjun Li, Qu Yang, Wei Han, and Yong Liu. SRR-Judge: Step-level rating and refinement for enhancing search-integrated reasoning in search agents. arXiv preprint arXiv:2602.07773, 2026
-
[63]
Jiajie Zhang, Yushi Bai, Xin Lv, Wanjun Gu, Danqing Liu, Minhao Zou, Shulin Cao, Lei Hou, Yuxiao Dong, Ling Feng, and Juanzi Li. LongCite: Enabling LLMs to generate fine-grained citations in long-context qa.arXiv preprint arXiv:2409.02897, 2024
-
[64]
Jiajie Zhang, Xin Lv, Ling Feng, Lei Hou, and Juanzi Li. Chaining the evidence: Robust reinforcement learning for deep search agents with citation-aware rubric rewards.arXiv preprint arXiv:2601.06021, 2026
-
[65]
Yuxiang Zhang, Jing Chen, Junjie Wang, Yaxin Liu, Cheng Yang, Chufan Shi, Xinyu Zhu, Zihao Lin, Hanwen Wan, Yujiu Yang, Tetsuya Sakai, Tian Feng, and Hayato Yamana. ToolBeHonest: A multi-level hallucination diagnostic benchmark for tool-augmented large language models. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, 2024
work page 2024
-
[66]
Judging LLM-as-a-judge with MT-Bench and chatbot arena
Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric Xing, et al. Judging LLM-as-a-judge with MT-Bench and chatbot arena. InAdvances in Neural Information Processing Systems, 2023
work page 2023
-
[67]
Lucen Zhong, Zhengxiao Du, Xiaohan Zhang, Haiyi Hu, and Jie Tang. Complexfuncbench: Exploring multi-step and constrained function calling under long-context scenario, 2025. URL https://arxiv.org/abs/2501.10132
-
[68]
Yilun Zhou, Austin Xu, PeiFeng Wang, Caiming Xiong, and Shafiq Joty. Evaluating judges as evaluators: The JETTS benchmark of LLM-as-judges as test-time scaling evaluators. In Forty-second International Conference on Machine Learning, 2025. 14
work page 2025
-
[69]
Where llm agents fail and how they can learn from failures.arXiv preprint arXiv:2509.25370, 2025
Kunlun Zhu, Zijia Liu, Bingxuan Li, Muxin Tian, Yingxuan Yang, Jiaxun Zhang, Pengrui Han, Qipeng Xie, Fuyang Cui, Weijia Zhang, Xiaoteng Ma, Xiaodong Yu, Gowtham Ramesh, Jialian Wu, Zicheng Liu, Pan Lu, James Zou, and Jiaxuan You. Where llm agents fail and how they can learn from failures.arXiv preprint arXiv:2509.25370, 2025
-
[70]
Mingchen Zhuge, Changsheng Zhao, Dylan R. Ashley, Wenyi Wang, Dmitrii Khizbullin, Yunyang Xiong, Zechun Liu, Ernie Chang, Raghuraman Krishnamoorthi, Yuandong Tian, Yangyang Shi, Vikas Chandra, and Jürgen Schmidhuber. Agent-as-a-judge: Evaluate agents with agents. InProceedings of the 42nd International Conference on Machine Learning, volume 267 ofProceedi...
work page 2025
-
[71]
The Business Research Company.Global Video Editing Software Market Overview 2025
work page 2025
-
[72]
Mordor Intelligence.Video Editing Market Size, Share and Growth Research
-
[73]
Straits Research.Video Editing Software Market Size, Share & Growth
-
[74]
DataIntelo.Video Editing Service Market Report
-
[75]
The Business Research Company.Audio And Video Editing Software Market 2025
work page 2025
-
[76]
Virtue Market Research.AI Video Editing Tools Market | Size, Share, Growth | 2025–2030
work page 2025
-
[77]
Triple A Review.Video Editing Statistics You Need to Know in 2025
work page 2025
-
[78]
SendShort.Video Editing Software Market Statistics (2025)
work page 2025
-
[79]
PCMag.The Best Video Editing Software We’ve Tested
-
[80]
DIY Video Editor.Best Video Editing Software 2025 Reviewed and Compared
work page 2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.