pith. sign in

arxiv: 2605.19196 · v1 · pith:7G2OAF3Inew · submitted 2026-05-18 · 💻 cs.CL

Time to REFLECT: Can We Trust LLM Judges for Evidence-based Research Agents?

Pith reviewed 2026-05-20 10:12 UTC · model grok-4.3

classification 💻 cs.CL
keywords LLM judgesmeta-evaluationresearch agentsfailure detectionevidence verificationagent evaluationreliability
0
0 comments X

The pith

Current LLM judges for evidence-based research agents achieve overall accuracies below 55 percent and perform especially poorly on evidence verification.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper asks whether large language models can serve as reliable judges for the complex outputs of deep research agents that gather evidence, reason step by step, and produce reports. To answer this it builds REFLECT, a benchmark that starts with clean agent traces and inserts specific, localized errors from a detailed taxonomy of reasoning, tool-use, and report-quality failures. Experiments reveal that even the strongest models correctly identify these failures less than half the time. This matters because research agents are increasingly used for open-ended information tasks, and untrustworthy automated judges would undermine any attempt to scale their evaluation or oversight.

Core claim

The paper shows that current LLM judges remain unreliable for assessing deep research agents, with even the best-performing models achieving overall accuracies below 55 percent across reasoning, tool-use, and report-quality failures and with especially poor performance on evidence verification.

What carries the argument

REFLECT benchmark, which defines a taxonomy of process- and outcome-level failure modes and creates test cases by applying controlled and localized interventions to quality-screened agent execution traces.

If this is right

  • LLM judges cannot yet be deployed as the sole supervisor for research-agent outputs without additional safeguards.
  • Evidence-verification failures require targeted improvements before judges can be trusted on factual grounding.
  • Evaluation pipelines will need hybrid or multi-judge setups to compensate for individual model weaknesses.
  • Cost-reliability trade-offs must be measured explicitly when choosing which model to use as a judge.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • High-stakes deployment of research agents may need to retain human review loops until judge accuracy improves.
  • The same controlled-intervention approach could be adapted to test judges on other agentic domains such as code generation or data analysis.
  • Fine-tuning judges on the specific failure modes catalogued here offers a concrete next step for raising detection rates.

Load-bearing premise

The controlled and localized interventions on quality-screened agent execution traces produce realistic and verifiable instances of the defined failure modes that match real-world agent behavior.

What would settle it

Human experts independently label the same collection of intervened agent traces for the presence and type of each failure; if their labels disagree substantially with the LLM judge outputs, the reliability claim is falsified.

Figures

Figures reproduced from arXiv: 2605.19196 by Arman Cohan, Asaf Yehudai, Leyao Wang, Michal Shmueli-Scheuer, Peng Chen, Rex Ying, Yanan He, Yixin Liu.

Figure 1
Figure 1. Figure 1: Data distribution of REFLECT across reasoning-process (N = 140), tool-use (N = 132), and outcome-level (N = 200) error types. The outer rings represent the high-level failure dimensions of deep research agents and their corresponding proportions, while the inner rings break each dimension down into fine-grained error types defined by our taxonomy, which is summarized from prior work (see [PITH_FULL_IMAGE:… view at source ↗
Figure 2
Figure 2. Figure 2: Overview of the benchmark construction pipeline of [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Effects of rubric-guided evaluation and chain-of-thought reasoning on perturbation detection [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Failure detection accuracy across process-level and outcome-level perturbation types. [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Judge reliability across evaluation settings. (a) Best-of- [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Human annotation interface for perturbation validation. Annotators review the user query, [PITH_FULL_IMAGE:figures/full_fig_p019_6.png] view at source ↗
read the original abstract

Deep research agents increasingly automate complex information-seeking tasks, producing evidence-grounded reports via multi-step reasoning, tool use, and synthesis. Their growing role demands scalable, reliable evaluation, positioning LLM-as-judge as a supervision paradigm for assessing factual accuracy, evidence use, and reasoning quality. Yet the reliability of these judges for deep research agents remains poorly understood, posing a critical meta-evaluation problem: before deploying LLM judges to supervise research agents, we must first evaluate the judges themselves. Existing meta-evaluations fall short in two ways: (1) reliance on coarse, subjective human-preference agreement; (2) focus on instruction-following or verifiable tasks, leaving open-ended agent executions unexplored. To address these gaps, we introduce REFLECT (REliable Fine-grained LLM judge Evaluation via Controlled inTervention), a meta-evaluation benchmark targeting fine-grained failure detection in agentic environments. REFLECT defines a detailed taxonomy of process- and outcome-level failure modes, instantiated by performing controlled and localized interventions on quality-screened agent execution traces. This yields verifiable, comprehensive, and fine-grained instances for validating the judge models. Our experiments show that current LLM judges remain unreliable: even the best-performing models achieve overall accuracies below 55% across reasoning, tool-use, and report-quality failures, with especially poor performance on evidence verification. Together, our taxonomy and findings expose systematic judge limitations, reveal tradeoffs in cost and reliability, and offer actionable guidance for building more reliable evaluation pipelines for deep research agents.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper introduces REFLECT, a meta-evaluation benchmark for LLM judges assessing deep research agents. It defines a taxonomy of process- and outcome-level failure modes in reasoning, tool use, and report quality, then instantiates verifiable instances via controlled, localized interventions on quality-screened agent execution traces. Experiments show that even the strongest LLM judges achieve overall accuracies below 55% on these instances, with notably weak performance on evidence verification, and the work offers guidance on tradeoffs and improved evaluation pipelines.

Significance. If the REFLECT instances prove to be realistic proxies for naturally occurring agent failures, the results would meaningfully advance meta-evaluation practices in agentic NLP by exposing systematic limitations of current LLM judges on open-ended, evidence-grounded tasks. The fine-grained taxonomy and benchmark construction represent a concrete contribution that could inform hybrid or specialized judge designs, though the overall impact hinges on demonstrating that the intervention method preserves the distribution and subtlety of real multi-step agent errors.

major comments (2)
  1. [Abstract and Section 3 (REFLECT construction)] Abstract and method description: The headline finding that LLM judges remain unreliable (overall accuracies below 55%) is load-bearing on the assumption that controlled interventions on screened traces produce realistic and verifiable failure instances. No quantitative check—such as feature overlap statistics, error-rate calibration against organic traces, or human indistinguishability tests—is reported to confirm that the localized edits match the contextual embedding and subtlety of naturally arising failures in multi-step tool use and synthesis.
  2. [Experimental Results] Experimental results: The accuracy claims, including the especially poor performance on evidence verification, are presented without error bars, confidence intervals, or per-category dataset statistics (e.g., instance counts and balance across reasoning/tool/report failures). This omission prevents assessment of whether the sub-55% figures are robust or sensitive to sampling variation.
minor comments (1)
  1. [Taxonomy section] The taxonomy definitions would benefit from additional concrete examples of each failure mode to improve clarity and reproducibility for readers implementing similar interventions.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback. We address each major comment below and describe the revisions planned for the next version of the manuscript.

read point-by-point responses
  1. Referee: [Abstract and Section 3 (REFLECT construction)] Abstract and method description: The headline finding that LLM judges remain unreliable (overall accuracies below 55%) is load-bearing on the assumption that controlled interventions on screened traces produce realistic and verifiable failure instances. No quantitative check—such as feature overlap statistics, error-rate calibration against organic traces, or human indistinguishability tests—is reported to confirm that the localized edits match the contextual embedding and subtlety of naturally arising failures in multi-step tool use and synthesis.

    Authors: We agree that the realism of the controlled interventions is central to the benchmark's validity. The construction relies on quality-screened traces and localized edits chosen to maintain contextual coherence while enabling verification. However, the original submission did not include quantitative validation such as feature-overlap statistics or human indistinguishability tests. In the revision we will add a dedicated subsection to Section 3 that reports a human study on a random sample of 100 instances, comparing perceived naturalness and subtlety against unmodified traces, together with basic distributional statistics on intervention types. This will directly address the concern while preserving the verifiability advantage of the method. revision: yes

  2. Referee: [Experimental Results] Experimental results: The accuracy claims, including the especially poor performance on evidence verification, are presented without error bars, confidence intervals, or per-category dataset statistics (e.g., instance counts and balance across reasoning/tool/report failures). This omission prevents assessment of whether the sub-55% figures are robust or sensitive to sampling variation.

    Authors: We accept that the absence of uncertainty estimates and dataset statistics limits interpretability. In the revised manuscript we will augment the experimental section with bootstrap-derived error bars and 95% confidence intervals for all reported accuracies. We will also add a table (and corresponding text) listing the exact instance counts per failure category and the balance across reasoning, tool-use, and report-quality failures. These additions will allow readers to evaluate the robustness of the sub-55% results. revision: yes

Circularity Check

0 steps flagged

No circularity; empirical benchmark from new interventions on external traces

full rationale

The paper constructs REFLECT by defining a failure-mode taxonomy and applying controlled localized interventions to quality-screened agent execution traces, then measures LLM-judge accuracy on the resulting instances. The headline result (accuracies below 55%) is obtained directly from these experiments rather than from any self-referential equation, fitted parameter renamed as prediction, or load-bearing self-citation chain. No derivation step reduces to its own inputs by construction; the work is self-contained against the newly generated benchmark data.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on the assumption that the introduced taxonomy and intervention technique generate representative failure cases; no free parameters are explicitly fitted in the abstract description.

axioms (1)
  • domain assumption Controlled and localized interventions on quality-screened agent execution traces yield verifiable, comprehensive, and fine-grained failure instances.
    This premise is invoked to justify the benchmark construction and is required for the accuracy measurements to be meaningful.
invented entities (1)
  • REFLECT benchmark no independent evidence
    purpose: Meta-evaluation framework for LLM judges in agentic research settings
    Newly defined in the paper with its taxonomy and intervention procedure.

pith-pipeline@v0.9.0 · 5826 in / 1253 out tokens · 33016 ms · 2026-05-20T10:12:18.128343+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

81 extracted references · 81 canonical work pages · 7 internal anchors

  1. [1]

    gpt-oss-120b & gpt-oss-20b Model Card

    S. Agarwal et al. gpt-oss-120b & gpt-oss-20b model card.arXiv preprint arXiv:2508.10925, 2025

  2. [2]

    Introducing claude haiku 4.5

    Anthropic. Introducing claude haiku 4.5. Anthropic release announcement, 2025

  3. [3]

    Introducing claude opus 4.7

    Anthropic. Introducing claude opus 4.7. Anthropic release announcement, 2026

  4. [4]

    Benchmarking large language mod- els in retrieval-augmented generation

    Jiawei Chen, Hongyu Lin, Xianpei Han, and Le Sun. Benchmarking large language mod- els in retrieval-augmented generation. InProceedings of the AAAI Conference on Artificial Intelligence, 2024

  5. [5]

    Deepresearchgym: A free, transparent, and reproducible evaluation sandbox for deep research.arXiv preprint arXiv:2505.19253, 2025

    João Coelho, Jingjie Ning, Jingyuan He, Kangrui Mao, Abhijay Paladugu, Pranav Setlur, Jiahe Jin, Jamie Callan, João Magalhães, Bruno Martins, and Chenyan Xiong. Deepresearchgym: A free, transparent, and reproducible evaluation sandbox for deep research.arXiv preprint arXiv:2505.19253, 2025

  6. [6]

    DeepResearch Bench: A Comprehensive Benchmark for Deep Research Agents

    Mingxuan Du, Benfeng Xu, Chiwei Zhu, Xiaorui Wang, and Zhendong Mao. Deepresearch bench: A comprehensive benchmark for deep research agents.arXiv preprint arXiv:2506.11763, 2025

  7. [7]

    Alpacafarm: A simulation framework for methods that learn from human feedback

    Yann Dubois, Xuechen Li, Rohan Taori, Tianyi Zhang, Ishaan Gulrajani, Jimmy Ba, Carlos Guestrin, Percy Liang, and Tatsunori Hashimoto. Alpacafarm: A simulation framework for methods that learn from human feedback. InThirty-seventh Conference on Neural Information Processing Systems, 2023. URLhttps://openreview.net/forum?id=4hturzLcKX. 10

  8. [8]

    RAGAS: Automated evaluation of retrieval augmented generation

    Shahul Es, Jithin James, Luis Espinosa-Anke, and Steven Schockaert. RAGAS: Automated evaluation of retrieval augmented generation. InProceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics, 2024

  9. [9]

    Are we on the right way to assessing LLM-as-a-judge?arXiv preprint arXiv:2512.16041, 2025

    Yuanning Feng, Sinan Wang, Zhengxiang Cheng, Yao Wan, and Dongping Chen. Are we on the right way to assessing LLM-as-a-judge?arXiv preprint arXiv:2512.16041, 2025. URL https://arxiv.org/abs/2512.16041

  10. [10]

    Bosse, Jon Evans, Robert G

    FutureSearch, :, Nikos I. Bosse, Jon Evans, Robert G. Gambee, Daniel Hnyk, Peter Mühlbacher, Lawrence Phillips, Dan Schwarz, and Jack Wildman. Deep research bench: Evaluating ai web research agents, 2025. URLhttps://arxiv.org/abs/2506.06287

  11. [11]

    Enabling large language models to generate text with citations

    Tianyu Gao, Howard Yen, Jiatong Yu, and Danqi Chen. Enabling large language models to generate text with citations. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, 2023

  12. [12]

    Gemma 3 Technical Report

    Gemma Team et al. Gemma 3 technical report.arXiv preprint arXiv:2503.19786, 2025

  13. [13]

    Gemini 2.0 is now available to everyone

    Google. Gemini 2.0 is now available to everyone. Google Blog, feb 2025. URL https: //blog.google/innovation-and-ai/models-and-research/google-deepmind/ge mini-model-updates-february-2025/. Accessed: 2026-05-06

  14. [14]

    Gemini 2.5 flash is now in preview

    Google. Gemini 2.5 flash is now in preview. https://blog.google/products-and-p latforms/products/gemini/gemini-2-5-flash-preview/ , April 2025. Accessed: 2026-05-06

  15. [15]

    Gemini 3.1 pro model card

    Google DeepMind. Gemini 3.1 pro model card. https://deepmind.google/models/mod el-cards/gemini-3-1-pro/, February 2026. Accessed: 2026-05-06

  16. [16]

    The Llama 3 Herd of Models

    Aaron Grattafiori et al. The llama 3 herd of models.arXiv preprint arXiv:2407.21783, 2024

  17. [17]

    DEER: A benchmark for evaluating deep research agents on expert report generation.arXiv preprint arXiv:2512.17776, 2025

    Janghoon Han, Heegyu Kim, Changho Lee, Dahm Lee, Min Hyung Park, Hosung Song, Stanley Jungkyu Choi, Moontae Lee, and Honglak Lee. DEER: A benchmark for evaluating deep research agents on expert report generation.arXiv preprint arXiv:2512.17776, 2025

  18. [18]

    Step-DeepResearch technical report.arXiv preprint arXiv:2512.20491, 2025

    Chen Hu, Haikuo Du, Heng Wang, Lin Lin, Mingrui Chen, Peng Liu, Ruihang Miao, Tianchi Yue, Wang You, Wei Ji, Wei Yuan, Wenjin Deng, Xiaojian Yuan, Xiaoyun Zhang, Xiangyu Liu, Xikai Liu, Yanming Xu, Yicheng Cao, Yifei Zhang, Yongyao Wang, et al. Step-DeepResearch technical report.arXiv preprint arXiv:2512.20491, 2025

  19. [19]

    MetaTool benchmark for large language models: Deciding whether to use tools and which to use

    Yue Huang, Jiawen Shi, Yuan Li, Chenrui Fan, Siyuan Wu, Qihui Zhang, Yixin Liu, Pan Zhou, Yao Wan, Neil Zhenqiang Gong, and Lichao Sun. MetaTool benchmark for large language models: Deciding whether to use tools and which to use. InThe Twelfth International Conference on Learning Representations, 2024

  20. [20]

    Hwang, Varsha Kishore, Amanpreet Singh, Dany Haddad, Aakanksha Naik, Malachi Hamada, Jonathan Bragg, Mike D’Arcy, Daniel S

    Jena D. Hwang, Varsha Kishore, Amanpreet Singh, Dany Haddad, Aakanksha Naik, Malachi Hamada, Jonathan Bragg, Mike D’Arcy, Daniel S. Weld, Lucy Lu Wang, Doug Downey, and Sergey Feldman. Deep research, shallow evaluation: A case study in meta-evaluation for long-form qa benchmarks, 2026. URLhttps://arxiv.org/abs/2603.06942

  21. [21]

    Toolscan: A benchmark for characterizing errors in tool-use LLMs, 2025

    Shirley Kokane, Ming Zhu, Tulika Manoj Awalgaonkar, Jianguo Zhang, Akshara Prabhakar, Thai Quoc Hoang, Zuxin Liu, Rithesh R N, Liangwei Yang, Weiran Yao, Juntao Tan, Zhiwei Liu, Shelby Heinecke, Huan Wang, Juan Carlos Niebles, Caiming Xiong, and Silvio Savarese. Toolscan: A benchmark for characterizing errors in tool-use LLMs, 2025. URL https: //openrevie...

  22. [22]

    Deepwidesearch: Benchmarking depth and width in agentic information seeking.arXiv preprint arXiv:2510.20168, 2025

    Tian Lan, Bin Zhu, Qianghuai Jia, Junyang Ren, Haijun Li, Longyue Wang, Zhao Xu, Weihua Luo, and Kaifu Zhang. Deepwidesearch: Benchmarking depth and width in agentic information seeking.arXiv preprint arXiv:2510.20168, 2025

  23. [23]

    Retrieval-augmented generation for knowledge-intensive nlp tasks

    Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, Sebastian Riedel, and Douwe Kiela. Retrieval-augmented generation for knowledge-intensive nlp tasks. InAdvances in Neural Information Processing Systems, 2020. 11

  24. [24]

    ReportBench: Evaluating deep research agents via academic survey tasks.arXiv preprint arXiv:2508.15804, 2025

    Minghao Li, Ying Zeng, Zhihao Cheng, Cong Ma, and Kai Jia. ReportBench: Evaluating deep research agents via academic survey tasks.arXiv preprint arXiv:2508.15804, 2025

  25. [25]

    VerifyBench: A systematic benchmark for evaluating reasoning verifiers across domains.arXiv preprint arXiv:2507.09884, 2025

    Xuzhao Li, Xuchen Li, Shiyu Hu, Yongzhen Guo, and Wentao Zhang. VerifyBench: A systematic benchmark for evaluating reasoning verifiers across domains.arXiv preprint arXiv:2507.09884, 2025

  26. [26]

    G- Eval: NLG evaluation using GPT-4 with better human alignment

    Yang Liu, Dan Iter, Yichong Xu, Shuohang Wang, Ruochen Xu, and Chenguang Zhu. G- Eval: NLG evaluation using GPT-4 with better human alignment. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 2511–2522. Association for Computational Linguistics, 2023

  27. [27]

    ReIFE: Re-evaluating instruction-following evaluation

    Yixin Liu, Kejian Shi, Alexander Fabbri, Yilun Zhao, PeiFeng Wang, Chien-Sheng Wu, Shafiq Joty, and Arman Cohan. ReIFE: Re-evaluating instruction-following evaluation. In Luis Chiruzzo, Alan Ritter, and Lu Wang, editors,Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Languag...

  28. [28]

    MMLU-CF: A contamination- free multi-task language understanding benchmark

    Association for Computational Linguistics. ISBN 979-8-89176-189-6. doi: 10.18653/v1/ 2025.naacl-long.610. URLhttps://aclanthology.org/2025.naacl-long.610/

  29. [29]

    ToolSandbox: A stateful, conversational, interactive evaluation benchmark for LLM tool use capabilities

    Jiarui Lu, Thomas Holleis, Yizhe Zhang, Bernhard Aumayer, Feng Nan, Felix Bai, Shuang Ma, Shen Ma, Mengyu Li, Guoli Yin, Zirui Wang, and Ruoming Pang. ToolSandbox: A stateful, conversational, interactive evaluation benchmark for LLM tool use capabilities. InFindings of the Association for Computational Linguistics: NAACL 2025, 2025

  30. [30]

    Pal, and Siva Reddy

    Xing Han Lù, Amirhossein Kazemnejad, Nicholas Meade, Arkil Patel, Dongchan Shin, Ale- jandra Zambrano, Karolina Sta´nczak, Peter Shaw, Christopher J. Pal, and Siva Reddy. Agen- tRewardBench: Evaluating automatic evaluations of web agent trajectories.arXiv preprint arXiv:2504.08942, 2025. URLhttps://arxiv.org/abs/2504.08942

  31. [31]

    Smith, Hannaneh Hajishirzi, and Nathan Lambert

    Saumya Malik, Valentina Pyatkin, Sander Land, Jacob Morrison, Noah A. Smith, Hannaneh Hajishirzi, and Nathan Lambert. Rewardbench 2: Advancing reward model evaluation. InThe Fourteenth International Conference on Learning Representations, 2026

  32. [32]

    An expert schema for evaluating large language model errors in scholarly question-answering systems

    Anna Martin-Boyle, William Humphreys, Martha Brown, Cara Leckey, and Harmanpreet Kaur. An expert schema for evaluating large language model errors in scholarly question-answering systems. InProceedings of the 2026 CHI Conference on Human Factors in Computing Systems, 2026

  33. [33]

    FActScore: Fine-grained atomic evaluation of factual precision in long form text generation

    Sewon Min, Kalpesh Krishna, Xinxi Lyu, Mike Lewis, Wen-tau Yih, Pang Wei Koh, Mohit Iyyer, Luke Zettlemoyer, and Hannaneh Hajishirzi. FActScore: Fine-grained atomic evaluation of factual precision in long form text generation. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, 2023

  34. [34]

    the moon is made of marshmallows

    Yifei Ming, Senthil Purushwalkam, Shrey Pandit, Zixuan Ke, Xuan-Phi Nguyen, Caiming Xiong, and Shafiq Joty. FaithEval: Can your language model stay faithful to context, even if “the moon is made of marshmallows”. InThe Thirteenth International Conference on Learning Representations, 2025

  35. [35]

    WebGPT: Browser-assisted question-answering with human feedback

    Reiichiro Nakano, Jacob Hilton, Suchir Balaji, Jeff Wu, Long Ouyang, et al. Webgpt: Browser- assisted question-answering with human feedback.arXiv preprint arXiv:2112.09332, 2021

  36. [36]

    RAGTruth: A hallucination corpus for developing trustworthy retrieval- augmented language models

    Cheng Niu, Yuanhao Wu, Juno Zhu, Siliang Xu, Kashun Shum, Randy Zhong, Juntong Song, and Tong Zhang. RAGTruth: A hallucination corpus for developing trustworthy retrieval- augmented language models. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics, 2024

  37. [37]

    GPT-5 mini

    OpenAI. GPT-5 mini. https://developers.openai.com/api/docs/models/gpt-5-m ini, August 2025. Model version: gpt-5-mini-2025-08-07

  38. [38]

    Introducing gpt-5.3-codex

    OpenAI. Introducing gpt-5.3-codex. OpenAI release and API documentation, 2026

  39. [39]

    Gpt-5.4 model

    OpenAI. Gpt-5.4 model. OpenAI API documentation, 2026. 12

  40. [40]

    Gonzalez

    Shishir G Patil, Huanzhi Mao, Fanjia Yan, Charlie Cheng-Jie Ji, Vishnu Suresh, Ion Stoica, and Joseph E. Gonzalez. The berkeley function calling leaderboard (BFCL): From tool use to agentic evaluation of large language models, 2025. URL https://openreview.net/forum ?id=2GmDdhBdDk

  41. [41]

    Direct preference optimization: Your language model is secretly a reward model,

    Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D Manning, Stefano Ermon, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model,

  42. [42]

    URLhttps://openreview.net/forum?id=HPuSIXJaa9

  43. [43]

    ARES: An automated evaluation framework for retrieval-augmented generation systems

    Jon Saad-Falcon, Omar Khattab, Christopher Potts, and Matei Zaharia. ARES: An automated evaluation framework for retrieval-augmented generation systems. InProceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2024

  44. [44]

    Localizing and miti- gating errors in long-form question answering

    Rachneet Singh Sachdeva, Yixiao Song, Mohit Iyyer, and Iryna Gurevych. Localizing and miti- gating errors in long-form question answering. InFindings of the Association for Computational Linguistics: ACL 2025, pages 20437–20469, 2025

  45. [45]

    Toolformer: Language models can teach themselves to use tools

    Timo Schick, Jane Dwivedi-Yu, Roberto Dessi, Roberta Raileanu, Maria Lomeli, Eric Hambro, Luke Zettlemoyer, Nicola Cancedda, and Thomas Scialom. Toolformer: Language models can teach themselves to use tools. InThirty-seventh Conference on Neural Information Processing Systems, 2023. URLhttps://openreview.net/forum?id=Yacmpz84TH

  46. [46]

    Do LLM agents know how to ground, recover, and assess? a benchmark for epistemic competence in information-seeking agents.arXiv preprint arXiv:2509.22391, 2025

    Jiaqi Shao, Yuxiang Lin, Munish Prasad Lohani, Yufeng Miao, and Bing Luo. Do LLM agents know how to ground, recover, and assess? a benchmark for epistemic competence in information-seeking agents.arXiv preprint arXiv:2509.22391, 2025

  47. [47]

    Rulin Shao, Akari Asai, Shannon Zejiang Shen, Hamish Ivison, Varsha Kishore, Jingming Zhuo, Xinran Zhao, Molly Park, Samuel G. Finlayson, David Sontag, Tyler Murray, Sewon Min, Pradeep Dasigi, Luca Soldaini, Faeze Brahman, Wen tau Yih, Tongshuang Wu, Luke Zettlemoyer, Yoon Kim, Hannaneh Hajishirzi, and Pang Wei Koh. Dr tulu: Reinforcement learning with ev...

  48. [48]

    ResearchRubrics : A benchmark of prompts and rubrics for evaluating deep research agents

    Manasi Sharma, Chen Bo Calvin Zhang, Chaithanya Bandi, Clinton Wang, Ankit Aich, Huy Nghiem, Tahseen Rabbani, Ye Htet, Brian Jang, Sumana Basu, et al. Researchrubrics: A benchmark of prompts and rubrics for evaluating deep research agents.arXiv preprint arXiv:2511.07685, 2025

  49. [49]

    Judgebench: A benchmark for evaluating LLM-based judges

    Sijun Tan, Siyuan Zhuang, Kyle Montgomery, William Yuan Tang, Alejandro Cuadron, Chen- guang Wang, Raluca Popa, and Ion Stoica. Judgebench: A benchmark for evaluating LLM-based judges. InThe Thirteenth International Conference on Learning Representations, 2025

  50. [50]

    Tongyi DeepResearch Technical Report

    Tongyi DeepResearch Team, Baixuan Li, Bo Zhang, Dingchu Zhang, Fei Huang, Guangyu Li, Guoxin Chen, Huifeng Yin, Jialong Wu, Jingren Zhou, et al. Tongyi deepresearch technical report.arXiv preprint arXiv:2510.24701, 2025

  51. [51]

    DeepResearchEval: An automated framework for deep research task construction and agentic evaluation.arXiv preprint arXiv:2601.09688, 2026

    Yibo Wang, Lei Wang, Yue Deng, Keming Wu, Yao Xiao, Huanjin Yao, Liwei Kang, Hai Ye, Yongcheng Jing, and Lidong Bing. DeepResearchEval: An automated framework for deep research task construction and agentic evaluation.arXiv preprint arXiv:2601.09688, 2026

  52. [52]

    Long-form factuality in large language models

    Jerry Wei, Chengrun Yang, Xinying Song, Yifeng Lu, Nathan Zixia Hu, Jie Huang, Dustin Tran, Daiyi Peng, Ruibo Liu, Da Huang, Cosmo Du, and Quoc V Le. Long-form factuality in large language models. InThe Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024. URLhttps://openreview.net/forum?id=4M9f8VMt2C

  53. [53]

    Qwen3 Technical Report

    An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, Chujie Zheng, Dayiheng Liu, Fan Zhou, Fei Huang, Feng Hu, Hao Ge, Haoran Wei, Huan Lin, Jialong Tang, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jing Zhou, Jingren Zhou, Junyang Lin, Kai Dang, Keqin Bao, Kexin Yang, ...

  54. [54]

    React: Synergizing reasoning and acting in language models

    Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. React: Synergizing reasoning and acting in language models. InInternational Conference on Learning Representations, 2023

  55. [55]

    Yao, Y ., Wang, Y ., Zhang, Y ., Lu, Y ., Gu, T., Li, L., Zhao, D., Wu, K., Wang, H., Nie, P., Teng, Y ., and Wang, Y

    Yang Yao, Yixu Wang, Yuxuan Zhang, Yi Lu, Tianle Gu, Lingyu Li, Dingyi Zhao, Keming Wu, Haozhe Wang, Ping Nie, Yan Teng, and Yingchun Wang. A rigorous benchmark with multidimensional evaluation for deep research agents: From answers to reports.arXiv preprint arXiv:2510.02190, 2025

  56. [56]

    MiroEval: Benchmarking multimodal deep research agents in process and outcome.arXiv preprint arXiv:2603.28407, 2026

    Fangda Ye, Yuxin Hu, Pengxiang Zhu, Yibo Li, Ziqi Jin, Yao Xiao, Yibo Wang, Lei Wang, Zhen Zhang, Lu Wang, Yue Deng, Bin Wang, Yifan Zhang, Liangcai Su, Xinyu Wang, He Zhao, Chen Wei, Qiang Ren, Bryan Hooi, An Bo, Shuicheng Yan, and Lidong Bing. MiroEval: Benchmarking multimodal deep research agents in process and outcome.arXiv preprint arXiv:2603.28407, 2026

  57. [57]

    Yifei, Allen Chang, Chaitanya Malaviya, and Mark Yatskar

    Li S. Yifei, Allen Chang, Chaitanya Malaviya, and Mark Yatskar. Researchqa: Evaluating scholarly question answering at scale across 75 fields with survey-mined questions and rubrics,

  58. [58]

    URLhttps://arxiv.org/abs/2509.00496

  59. [59]

    Automatic evaluation of attribution by large language models

    Xiang Yue, Boshi Wang, Ziru Chen, Kai Zhang, Yu Su, and Huan Sun. Automatic evaluation of attribution by large language models. InFindings of the Association for Computational Linguistics: EMNLP 2024, 2024

  60. [60]

    Evaluating large language models at evaluating instruction following, 2024

    Zhiyuan Zeng, Jiatong Yu, Tianyu Gao, Yu Meng, Tanya Goyal, and Danqi Chen. Evaluating large language models at evaluating instruction following, 2024. URL https://arxiv.org/ abs/2310.07641

  61. [61]

    Why your deep research agent fails? on hallucination evaluation in full research trajectory.arXiv preprint arXiv:2601.22984, 2026

    Yuhao Zhan, Tianyu Fan, Linxuan Huang, Zirui Guo, and Chao Huang. Why your deep research agent fails? on hallucination evaluation in full research trajectory.arXiv preprint arXiv:2601.22984, 2026

  62. [62]

    SRR-Judge: Step-level rating and refinement for enhancing search-integrated reasoning in search agents

    Chen Zhang, Kuicai Dong, Dexun Li, Wenjun Li, Qu Yang, Wei Han, and Yong Liu. SRR-Judge: Step-level rating and refinement for enhancing search-integrated reasoning in search agents. arXiv preprint arXiv:2602.07773, 2026

  63. [63]

    LongCite: Enabling LLMs to generate fine-grained citations in long-context qa.arXiv preprint arXiv:2409.02897, 2024

    Jiajie Zhang, Yushi Bai, Xin Lv, Wanjun Gu, Danqing Liu, Minhao Zou, Shulin Cao, Lei Hou, Yuxiao Dong, Ling Feng, and Juanzi Li. LongCite: Enabling LLMs to generate fine-grained citations in long-context qa.arXiv preprint arXiv:2409.02897, 2024

  64. [64]

    Chaining the evidence: Robust reinforcement learning for deep search agents with citation-aware rubric rewards.arXiv preprint arXiv:2601.06021, 2026

    Jiajie Zhang, Xin Lv, Ling Feng, Lei Hou, and Juanzi Li. Chaining the evidence: Robust reinforcement learning for deep search agents with citation-aware rubric rewards.arXiv preprint arXiv:2601.06021, 2026

  65. [65]

    ToolBeHonest: A multi-level hallucination diagnostic benchmark for tool-augmented large language models

    Yuxiang Zhang, Jing Chen, Junjie Wang, Yaxin Liu, Cheng Yang, Chufan Shi, Xinyu Zhu, Zihao Lin, Hanwen Wan, Yujiu Yang, Tetsuya Sakai, Tian Feng, and Hayato Yamana. ToolBeHonest: A multi-level hallucination diagnostic benchmark for tool-augmented large language models. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, 2024

  66. [66]

    Judging LLM-as-a-judge with MT-Bench and chatbot arena

    Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric Xing, et al. Judging LLM-as-a-judge with MT-Bench and chatbot arena. InAdvances in Neural Information Processing Systems, 2023

  67. [67]

    ComplexFuncBench: Exploring multi-step and constrained function calling under long-context scenario.arXiv preprint arXiv:2501.10132, 2025

    Lucen Zhong, Zhengxiao Du, Xiaohan Zhang, Haiyi Hu, and Jie Tang. Complexfuncbench: Exploring multi-step and constrained function calling under long-context scenario, 2025. URL https://arxiv.org/abs/2501.10132

  68. [68]

    Evaluating judges as evaluators: The JETTS benchmark of LLM-as-judges as test-time scaling evaluators

    Yilun Zhou, Austin Xu, PeiFeng Wang, Caiming Xiong, and Shafiq Joty. Evaluating judges as evaluators: The JETTS benchmark of LLM-as-judges as test-time scaling evaluators. In Forty-second International Conference on Machine Learning, 2025. 14

  69. [69]

    Where llm agents fail and how they can learn from failures.arXiv preprint arXiv:2509.25370, 2025

    Kunlun Zhu, Zijia Liu, Bingxuan Li, Muxin Tian, Yingxuan Yang, Jiaxun Zhang, Pengrui Han, Qipeng Xie, Fuyang Cui, Weijia Zhang, Xiaoteng Ma, Xiaodong Yu, Gowtham Ramesh, Jialian Wu, Zicheng Liu, Pan Lu, James Zou, and Jiaxuan You. Where llm agents fail and how they can learn from failures.arXiv preprint arXiv:2509.25370, 2025

  70. [70]

    coherence

    Mingchen Zhuge, Changsheng Zhao, Dylan R. Ashley, Wenyi Wang, Dmitrii Khizbullin, Yunyang Xiong, Zechun Liu, Ernie Chang, Raghuraman Krishnamoorthi, Yuandong Tian, Yangyang Shi, Vikas Chandra, and Jürgen Schmidhuber. Agent-as-a-judge: Evaluate agents with agents. InProceedings of the 42nd International Conference on Machine Learning, volume 267 ofProceedi...

  71. [71]

    The Business Research Company.Global Video Editing Software Market Overview 2025

  72. [72]

    Mordor Intelligence.Video Editing Market Size, Share and Growth Research

  73. [73]

    Straits Research.Video Editing Software Market Size, Share & Growth

  74. [74]

    DataIntelo.Video Editing Service Market Report

  75. [75]

    The Business Research Company.Audio And Video Editing Software Market 2025

  76. [76]

    Virtue Market Research.AI Video Editing Tools Market | Size, Share, Growth | 2025–2030

  77. [77]

    Triple A Review.Video Editing Statistics You Need to Know in 2025

  78. [78]

    SendShort.Video Editing Software Market Statistics (2025)

  79. [79]

    PCMag.The Best Video Editing Software We’ve Tested

  80. [80]

    DIY Video Editor.Best Video Editing Software 2025 Reviewed and Compared

Showing first 80 references.