DR$^{3}$-Eval: Towards Realistic and Reproducible Deep Research Evaluation

Chengkang Jiang; Fanyu Meng; He Zhu; Jiaheng Liu; Jiakai Wang; Jiayang Mao; Junlan Feng; Qianqian Xie; Qingheng Xiong; Shihao Li

arxiv: 2604.14683 · v1 · submitted 2026-04-16 · 💻 cs.AI

DR³-Eval: Towards Realistic and Reproducible Deep Research Evaluation

Qianqian Xie , Qingheng Xiong , He Zhu , Tiantian Xia , Xueming Han , Fanyu Meng , Jiakai Wang , Zhiqi Bai

show 11 more authors

Chengkang Jiang Zhaohui Wang Yubin Guo Yuqing Wen Jiayang Mao Zijie Zhang Shihao Li Yanghai Wang Yuxiang Ren Junlan Feng Jiaheng Liu

This is my paper

Pith reviewed 2026-05-10 11:44 UTC · model grok-4.3

classification 💻 cs.AI

keywords deep research agentsevaluation benchmarkmultimodal report generationretrieval robustnesshallucination controlstatic research sandboxfactual accuracymulti-agent systems

0 comments

The pith

DR³-Eval provides a reproducible benchmark using static verifiable sandboxes to evaluate deep research agents on complex multimodal tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper seeks to establish a new way to evaluate deep research agents that overcomes the problems of changing web content and unclear task goals. It does this by building the benchmark from real user materials and pairing each task with a fixed set of documents that include useful information, irrelevant distractors, and noise to mimic the open web but allow exact verification of answers. The authors also define a scoring system across five dimensions that matches what humans would judge as good performance. If this benchmark works as described, researchers could reliably compare different agent systems and identify where they break down in gathering facts or avoiding invented details. This matters because deep research agents are meant to handle long, complicated inquiries that current AI tools still struggle with.

Core claim

DR³-Eval is constructed from authentic user-provided materials and paired with a per-task static research sandbox corpus that simulates open-web complexity while remaining fully verifiable, containing supportive documents, distractors, and noise. A multi-dimensional evaluation framework measures Information Recall, Factual Accuracy, Citation Coverage, Instruction Following, and Depth Quality, and it aligns with human judgments. Experiments using a multi-agent system based on multiple state-of-the-art language models show that the benchmark is highly challenging and reveals critical failure modes in retrieval robustness and hallucination control.

What carries the argument

The per-task static research sandbox corpus, which simulates open-web complexity in a fully verifiable manner by including supportive documents, distractors, and noise alongside authentic task materials.

If this is right

Current deep research agents struggle with maintaining retrieval robustness across the benchmark tasks.
These agents have difficulty controlling hallucinations in their generated multimodal reports.
The proposed multi-dimensional evaluation aligns closely with human judgments of report quality.
The benchmark enables reproducible experiments without reliance on dynamic web environments.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Approaches like this static sandbox could be extended to create benchmarks for agent performance in other knowledge-intensive fields.
Identifying these specific failure modes may guide targeted improvements in agent design for better fact handling.
Reproducible benchmarks of this type could accelerate progress by providing consistent metrics for comparing new agent architectures.

Load-bearing premise

The per-task static research sandbox corpus simulates open-web complexity while remaining fully verifiable, containing supportive documents, distractors, and noise.

What would settle it

Showing that state-of-the-art models complete the tasks with high scores on all evaluation dimensions and without retrieval or hallucination issues would challenge the claim that the benchmark reveals critical failure modes.

Figures

Figures reproduced from arXiv: 2604.14683 by Chengkang Jiang, Fanyu Meng, He Zhu, Jiaheng Liu, Jiakai Wang, Jiayang Mao, Junlan Feng, Qianqian Xie, Qingheng Xiong, Shihao Li, Tiantian Xia, Xueming Han, Yanghai Wang, Yubin Guo, Yuqing Wen, Yuxiang Ren, Zhaohui Wang, Zhiqi Bai, Zijie Zhang.

**Figure 2.** Figure 2: Overview of the DR3 -Eval framework. (1) Data construction synthesizes search paths from real-world multimodal files via a divergent-convergent mechanism, establishing a static sandbox with controlled signal-to-noise ratios and backward-derived queries. (2) Our DR3 -Agent adopts a hierarchical multi-agent architecture where a perception-enhanced Main Agent coordinates global reasoning while specialized sub… view at source ↗

**Figure 3.** Figure 3: Dataset statistics. (a) Domain coverage spanning Technology, Economy, and Humanities, [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 4.** Figure 4: Performance of different LLMs across different [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

**Figure 5.** Figure 5: Analysis on the effectiveness of sandbox corpus. Analysis on the correlation between sandbox corpus and real-world web corpus. To further verify whether the sandbox corpus can approximate information acquisition in realworld web environments, we conduct experiments with realtime web search on an English subset using Qwen3-235B and Gemini-2.5-Pro. As shown in [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

**Figure 6.** Figure 6: Analysis on the performance of different sizes of sandbox corpus. [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗

**Figure 7.** Figure 7: Comparison of framework architectures. Analysis on the effectiveness of sandbox corpus. To verify the reasonableness of our sandbox corpus design, we systematically analyze the impact of different document components on model performance using a sample of 20 tasks. The experiments are mainly based on the 128k-sized corpus (except for the only supportive setting). In [PITH_FULL_IMAGE:figures/full_fig_p… view at source ↗

**Figure 8.** Figure 8: Error type analysis across LLMs. (1) Retrieval Error, denoting where the agent fails to locate or omits key information required to answer the question during the retrieval stage; (2) Reasoning Error, denoting where the agent, despite obtaining relevant information, makes mistakes in information integration, logical inference, or detail processing; and (3) Hallucination, denoting where the model’s gene… view at source ↗

**Figure 9.** Figure 9: Breakdown of specific file formats for documents and images. [PITH_FULL_IMAGE:figures/full_fig_p015_9.png] view at source ↗

**Figure 10.** Figure 10: t-SNE visualization of the semantic distribution in the Sandbox Corpus. [PITH_FULL_IMAGE:figures/full_fig_p015_10.png] view at source ↗

**Figure 11.** Figure 11: The view of user files [PITH_FULL_IMAGE:figures/full_fig_p017_11.png] view at source ↗

read the original abstract

Deep Research Agents (DRAs) aim to solve complex, long-horizon research tasks involving planning, retrieval, multimodal understanding, and report generation, yet their evaluation remains challenging due to dynamic web environments and ambiguous task definitions. We propose DR$^{3}$-Eval, a realistic and reproducible benchmark for evaluating deep research agents on multimodal, multi-file report generation. DR$^{3}$-Eval is constructed from authentic user-provided materials and paired with a per-task static research sandbox corpus that simulates open-web complexity while remaining fully verifiable, containing supportive documents, distractors, and noise. Moreover, we introduce a multi-dimensional evaluation framework measuring Information Recall, Factual Accuracy, Citation Coverage, Instruction Following, and Depth Quality, and validate its alignment with human judgments. Experiments with our developed multi-agent system DR$^{3}$-Agent based on multiple state-of-the-art language models demonstrate that DR$^{3}$-Eval is highly challenging and reveals critical failure modes in retrieval robustness and hallucination control. Our code and data are publicly available.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

DR³-Eval supplies a concrete benchmark with static sandboxes and five scoring dimensions, but the abstract leaves the human alignment and experiment details too thin to fully support the main claims.

read the letter

The main thing to know is that this paper puts forward DR³-Eval as a new benchmark for deep research agents, built around user materials and a per-task static corpus meant to include relevant documents plus distractors and noise. They pair it with a five-part evaluation covering recall, factual accuracy, citations, instruction following, and depth quality, then test it on their own multi-agent system DR³-Agent across several models. The public code and data release is a clear positive for anyone who wants to run or extend the work themselves.

Referee Report

2 major / 1 minor

Summary. The paper proposes DR³-Eval, a benchmark for Deep Research Agents focused on multimodal, multi-file report generation tasks. It pairs authentic user materials with a per-task static research sandbox corpus containing supportive documents, distractors, and noise, intended to simulate open-web complexity while remaining verifiable. A multi-dimensional evaluation framework is introduced measuring Information Recall, Factual Accuracy, Citation Coverage, Instruction Following, and Depth Quality, with a claimed validation against human judgments. Experiments on the authors' DR³-Agent system using multiple state-of-the-art LLMs are said to show the benchmark is highly challenging and exposes failure modes in retrieval robustness and hallucination control. Code and data are released publicly.

Significance. If the evaluation framework's alignment with human judgments holds and the static sandbox successfully surfaces transferable failure modes, DR³-Eval could offer a valuable reproducible alternative to dynamic web evaluations for long-horizon research agents. The public code and data release is a clear strength for reproducibility.

major comments (2)

[Abstract] Abstract: the claim that the multi-dimensional evaluation framework 'aligns with human judgments' is unsupported, as no details are provided on the human evaluation protocol, number of annotators, inter-annotator agreement, statistical tests, or quantitative alignment results.
[Abstract and sandbox construction] Benchmark description (Abstract and sandbox construction section): the central claim that the per-task static research sandbox 'simulates open-web complexity' while revealing general failure modes rests on an untested assumption; a fixed corpus cannot reproduce live search ranking changes, temporal drift, or iterative reformulation against an evolving index, risking sandbox-specific artifacts rather than transferable limitations.

minor comments (1)

[Abstract] Abstract: expand the DR³ acronym on first use and clarify whether 'multi-file' refers to multiple source documents or output files.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment point by point below, indicating planned revisions where appropriate.

read point-by-point responses

Referee: [Abstract] Abstract: the claim that the multi-dimensional evaluation framework 'aligns with human judgments' is unsupported, as no details are provided on the human evaluation protocol, number of annotators, inter-annotator agreement, statistical tests, or quantitative alignment results.

Authors: We agree that the abstract would benefit from additional context to make this claim self-contained. The main body of the manuscript provides a dedicated description of the human evaluation protocol along with the associated quantitative alignment results. To address the concern, we will revise the abstract to include a concise reference to the validation approach and key findings, directing readers to the relevant section for full details. revision: yes
Referee: [Abstract and sandbox construction] Benchmark description (Abstract and sandbox construction section): the central claim that the per-task static research sandbox 'simulates open-web complexity' while revealing general failure modes rests on an untested assumption; a fixed corpus cannot reproduce live search ranking changes, temporal drift, or iterative reformulation against an evolving index, risking sandbox-specific artifacts rather than transferable limitations.

Authors: We acknowledge the inherent limitations of any static sandbox in fully replicating dynamic web behaviors such as ranking fluctuations, temporal changes, or iterative query reformulation against a live index. Our design prioritizes verifiability and reproducibility, which are necessary for a benchmark that supports consistent evaluation across research efforts. The corpus incorporates supportive documents, distractors, and noise to approximate open-web complexity, and the reported experiments highlight failure modes in retrieval and hallucination control. We will add an explicit discussion of these design trade-offs and the potential for sandbox-specific artifacts in the limitations section of the revised manuscript. revision: partial

Circularity Check

0 steps flagged

No circularity: benchmark and agent presented as independent artifacts

full rationale

The paper introduces DR³-Eval as a new benchmark constructed from authentic user materials paired with a per-task static sandbox, along with a multi-dimensional evaluation framework and a multi-agent system DR³-Agent. No equations, derivations, or predictions appear in the provided text. The sandbox is explicitly described as an independent construction that remains verifiable, with public code and data released. Experiments demonstrate challenges on this benchmark but do not reduce any claimed result to a fitted parameter or self-referential definition. Self-citations, if present, are not load-bearing for the central claims, satisfying the criteria for a self-contained empirical contribution.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

Only the abstract is available, so the ledger is necessarily incomplete; the main unstated premise is that human judgment is the appropriate external validator for report quality.

axioms (1)

domain assumption Human judgments constitute a reliable and stable ground truth for measuring report quality dimensions such as depth and factual accuracy
The paper states that the framework is validated against human judgments but provides no further justification or alternative validation method.

invented entities (1)

DR³-Eval benchmark with static sandbox corpus no independent evidence
purpose: To enable realistic yet reproducible evaluation of deep research agents on multimodal report generation
Newly introduced in the paper; no external independent evidence of its effectiveness is supplied in the abstract.

pith-pipeline@v0.9.0 · 5547 in / 1448 out tokens · 52623 ms · 2026-05-10T11:44:10.302139+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

57 extracted references · 57 canonical work pages

[1]

China’s high-speed rail network is dense, especially in the east

work page
[2]

The map shows the extensive network as of November 2023

work page 2023
[3]

The network includes lines with speeds of 300 km/h or more

work page
[4]

Rail lines are color-coded by speed, from<200 to≥300 km/h

work page
[5]

Map of Japan’s Shinkansen lines as of March 2025

work page 2025
[6]

Shows operational, planned, and under-construction routes

work page
[7]

A future Linear Ch ¯u¯o Shinkansen (maglev) line is projected

work page
[8]

The network connects major cities like Tokyo, Osaka, and Hakata

work page
[9]

Developed from non-existent to world-class in just over 10 years

work page
[10]

Current trains travel at world-leading speeds of 300-350 km/h

work page
[11]

The new CR450 EMU prototype is the world’s fastest

work page
[12]

CRH380A reaching up to 380 km/h

CR450 prototype reaches 450 km/h in tests. 20 Table 9: Evaluation of Information Recall from User Files. Number Status Evidence 1 Covered The network analysis reveals dense connectivity in eastern and central regions, with key routes connecting major cities... 2 Half Covered The map shows a well-developed network ... as of November 27, 2009, with continue...

work page 2009
[13]

Reducing aerodynamic resistance is crucial for faster trains

work page
[14]

Shinkansen’s strengths are efficiency and passenger comfort

work page
[15]

China has an ambitious 2035 high-speed rail expansion plan

work page 2035
[16]

Digital transformation is key to future rail network evolution

work page
[17]

Future rail relies on IoT, 5G, and AI technologies

work page
[18]

planning

China plans to extend its HSR network to Southeast Asia. F.2 Citation Coverage Table 11: Evaluation of Citation Coverage. No. Source Title Status Web Page Coverage 1 Japan’s Shinkansen: How Does It Stack Up Worldwide?Cited 2 The global rail transportation market was valued at US$ 724,180 million in 2022 and, by 2029, is pro Cited (Continued on next page) ...

work page 2022
[19]

Concise: Query must be SHORT (50-100 words), like a real user’s brief question, not verbose 2.Natural: Query should be from user’s perspective, like a real person would ask

work page
[20]

relevant keywords

Guiding: Query topic should naturally lead agent to search “relevant keywords", but don’t over-hint

work page
[21]

No Exposure: Don’t directly use technical terms from keywords, use simple natural expressions

work page
[22]

based on my xxx file

Brief File Reference: Query must briefly mention user files, like “based on my xxx file" or “see attachment"

work page
[23]

Cover All Results: Query must be designed so ALL len(useful_search) search results are needed for a complete answer, even if each result is only used a little

work page
[24]

Use All Files: Query must be designed so ALL len(user_file_names) user files are needed for a complete answer, even if each file is only used a little Design Approach

work page
[25]

Analyze the common theme of relevant keywords

work page
[26]

Design a SHORT natural query (50-100 words), don’t over-describe background

work page
[27]

Three-distance method spatial layout modern pocket park design cases

Query should: • Be short and direct, like a casual question • Not contain technical jargon or hint-like words • Briefly mention user files ExamplesIf relevant keywords are: - “Three-distance method spatial layout modern pocket park design cases" - “Scattered perspective step-by-step scenery urban micro-renewal" User file is: - Suzhou_Garden_Design.pdf ✗BA...

work page
[30]

Machine learning requires large amounts of data

Atomicity: Each insight must be atomic, containing only 1-12 words, expressing a simple fact or concept Examples of Common Knowledge (DO NOT Extract) • “Machine learning requires large amounts of data"→This is common knowledge • “User experience is important"→This is common knowledge • “This method improved accuracy"→Too vague, no specific value or compar...

work page 2024
[31]

proposes AfME em- bedding

Source Contribution: Extract the main contribution of each source to answering the query, such as: • Methods/techniques/concepts introduced by the source (e.g., “proposes AfME em- bedding", “uses MCMC optimization") • Core topics or problems discussed by the source • Key conclusions or findings of the source • Note: No need to extract precise numbers (e.g...

work page
[32]

Verifiability: Can determine whether the report mentions this information (semantic similarity is sufficient, exact match not required)

work page
[33]

Machine learning requires large amounts of data

Atomicity: Each insight must be atomic, containing only 1-12 words, expressing a simple fact or concept Examples of Common Knowledge (DO NOT Extract) • “Machine learning requires large amounts of data"→This is common knowledge 28 • “User experience is important"→This is common knowledge • “This method improved accuracy"→Too vague, no specific value or com...

work page 2024
[34]

Analyze aspects A, B, and C

Atomic Decomposition: Break down complex requirements into minimal, independent checkpoints • Each requirement checks only one specific point • Example: “Analyze aspects A, B, and C" → Split into “Mention A", “Mention B", “Mention C" • Example: “Compare X and Y" → Split into “Describe X", “Describe Y", “Explain differences" 2.Short and Clear: Each require...

work page 2023
[35]

Only part of the core meaning is covered (missing key details)

work page
[36]

The topic is mentioned but specifics are absent

work page
[37]

Related concept exists but not the exact point

work page
[38]

Generalization without the specific insight

work page
[39]

Shanghai’s garbage classification coverage rate will reach 95% by 2023

The connection requires inference (not explicit) Examples of 0.5: • Insight: “Shanghai’s garbage classification coverage rate will reach 95% by 2023" Report: “Shanghai’s garbage classification has achieved significant results"→ 0.5 (topic covered, but no specific percentage) • Insight: “Germany adopts a dual track recycling system" Report: “Developed coun...

work page 2023
[40]

If >50% of core meaning is covered→1.0

work page
[41]

If reasonable semantic connection exists→1.0

work page
[42]

If only weak connection or keyword overlap→0.5

work page
[43]

results": [ “id

If no connection at all→0.0 Principle: Prefer false positives over false negatives (The goal of recall assessment is to check if information is missing) RESPONSE FORMAT Respond ONLY with valid JSON (no markdown, no extra text): “results": [ “id": 1, “core_points": [“point1", “point2"], “found_in_report": “[quote or describe what was found]", “missing_poin...

work page
[44]

A statement is ageneralization, summary, inference, or extensionof the content of the source document

work page
[45]

The statements use different wording, buthave similar semantics

work page
[46]

The statement containsimplicit informationfrom the source document

work page
[47]

For images/videos: The content described may be visually visible or inferable

work page
[48]

The statement is areasonable interpretationof the content of the source document, even if it is not the only interpretation

work page
[49]

The source document containspartially supportingcontent for this statement Situations where it is determined as supported: false (limited to the following situations) Only when one of the following conditions is met, it is determined as false:

work page
[50]

Statements that aredirectly contradictoryto the source document (such as significant errors in numbers or completely opposite facts)

work page
[51]

The source documentcompletely lacksany relevant content stated

work page
[52]

The company’s revenue increased by 25% in 2023

The statement cannot be reasonably inferred from the source document Judgment principles • Allowsubstantial generalization and inference • Allowwording differencesanddifferent ways of expression • Allowpartially correctstatements (as long as they are not completely wrong) • For situations that areambiguous or uncertain, they should all be determined as tr...

work page 2023
[53]

Look for subtle problems, minor inconsistencies, areas that could be improved, or any shortcomings that might affect the quality

A research question that the report attempts to answer <research_question> Question </research_question> <Report> result_text </Report> Instructions: ANALYZE THOROUGHLY: Examine the report in detail and identify any issues, even small ones. Look for subtle problems, minor inconsistencies, areas that could be improved, or any shortcomings that might affect...

work page
[54]

Do NOT cluster scores in a narrow range

Use the FULL scoring range: Distribute scores across 1-10 based on actual quality differ- ences. Do NOT cluster scores in a narrow range

work page
[55]

Only truly exceptional work deserves 10

Differentiate clearly: A mediocre report should score 4-5, a good report 6-7, an excellent report 8-9. Only truly exceptional work deserves 10

work page
[56]

Better analysis, clearer structure, and deeper insights should result in higher scores

Be discriminating: Look for specific quality differences between reports. Better analysis, clearer structure, and deeper insights should result in higher scores

work page
[57]

Penalize appropriately: Minor issues = small deductions (0.5-1 point), major issues = significant deductions (2-3 points)

work page
[58]

Reward excellence: If a report demonstrates exceptional depth, clarity, or insight, give it the high score it deserves

work page
[59]

Compare mentally: Consider how this report compares to the best and worst possible reports on this topic. Evaluation Criterion: Depth & Quality of Analysis Evaluate how thoroughly the report analyzes the research question.BE HARSH: Look for superficiality, missing details, lack of evidence, weak reasoning. •1-2: Completely superficial, no real analysis, j...

work page

[1] [1]

China’s high-speed rail network is dense, especially in the east

work page

[2] [2]

The map shows the extensive network as of November 2023

work page 2023

[3] [3]

The network includes lines with speeds of 300 km/h or more

work page

[4] [4]

Rail lines are color-coded by speed, from<200 to≥300 km/h

work page

[5] [5]

Map of Japan’s Shinkansen lines as of March 2025

work page 2025

[6] [6]

Shows operational, planned, and under-construction routes

work page

[7] [7]

A future Linear Ch ¯u¯o Shinkansen (maglev) line is projected

work page

[8] [8]

The network connects major cities like Tokyo, Osaka, and Hakata

work page

[9] [9]

Developed from non-existent to world-class in just over 10 years

work page

[10] [10]

Current trains travel at world-leading speeds of 300-350 km/h

work page

[11] [11]

The new CR450 EMU prototype is the world’s fastest

work page

[12] [12]

CRH380A reaching up to 380 km/h

CR450 prototype reaches 450 km/h in tests. 20 Table 9: Evaluation of Information Recall from User Files. Number Status Evidence 1 Covered The network analysis reveals dense connectivity in eastern and central regions, with key routes connecting major cities... 2 Half Covered The map shows a well-developed network ... as of November 27, 2009, with continue...

work page 2009

[13] [13]

Reducing aerodynamic resistance is crucial for faster trains

work page

[14] [14]

Shinkansen’s strengths are efficiency and passenger comfort

work page

[15] [15]

China has an ambitious 2035 high-speed rail expansion plan

work page 2035

[16] [16]

Digital transformation is key to future rail network evolution

work page

[17] [17]

Future rail relies on IoT, 5G, and AI technologies

work page

[18] [18]

planning

China plans to extend its HSR network to Southeast Asia. F.2 Citation Coverage Table 11: Evaluation of Citation Coverage. No. Source Title Status Web Page Coverage 1 Japan’s Shinkansen: How Does It Stack Up Worldwide?Cited 2 The global rail transportation market was valued at US$ 724,180 million in 2022 and, by 2029, is pro Cited (Continued on next page) ...

work page 2022

[19] [19]

Concise: Query must be SHORT (50-100 words), like a real user’s brief question, not verbose 2.Natural: Query should be from user’s perspective, like a real person would ask

work page

[20] [20]

relevant keywords

Guiding: Query topic should naturally lead agent to search “relevant keywords", but don’t over-hint

work page

[21] [21]

No Exposure: Don’t directly use technical terms from keywords, use simple natural expressions

work page

[22] [22]

based on my xxx file

Brief File Reference: Query must briefly mention user files, like “based on my xxx file" or “see attachment"

work page

[23] [23]

Cover All Results: Query must be designed so ALL len(useful_search) search results are needed for a complete answer, even if each result is only used a little

work page

[24] [24]

Use All Files: Query must be designed so ALL len(user_file_names) user files are needed for a complete answer, even if each file is only used a little Design Approach

work page

[25] [25]

Analyze the common theme of relevant keywords

work page

[26] [26]

Design a SHORT natural query (50-100 words), don’t over-describe background

work page

[27] [27]

Three-distance method spatial layout modern pocket park design cases

Query should: • Be short and direct, like a casual question • Not contain technical jargon or hint-like words • Briefly mention user files ExamplesIf relevant keywords are: - “Three-distance method spatial layout modern pocket park design cases" - “Scattered perspective step-by-step scenery urban micro-renewal" User file is: - Suzhou_Garden_Design.pdf ✗BA...

work page

[28] [30]

Machine learning requires large amounts of data

Atomicity: Each insight must be atomic, containing only 1-12 words, expressing a simple fact or concept Examples of Common Knowledge (DO NOT Extract) • “Machine learning requires large amounts of data"→This is common knowledge • “User experience is important"→This is common knowledge • “This method improved accuracy"→Too vague, no specific value or compar...

work page 2024

[29] [31]

proposes AfME em- bedding

Source Contribution: Extract the main contribution of each source to answering the query, such as: • Methods/techniques/concepts introduced by the source (e.g., “proposes AfME em- bedding", “uses MCMC optimization") • Core topics or problems discussed by the source • Key conclusions or findings of the source • Note: No need to extract precise numbers (e.g...

work page

[30] [32]

Verifiability: Can determine whether the report mentions this information (semantic similarity is sufficient, exact match not required)

work page

[31] [33]

Machine learning requires large amounts of data

Atomicity: Each insight must be atomic, containing only 1-12 words, expressing a simple fact or concept Examples of Common Knowledge (DO NOT Extract) • “Machine learning requires large amounts of data"→This is common knowledge 28 • “User experience is important"→This is common knowledge • “This method improved accuracy"→Too vague, no specific value or com...

work page 2024

[32] [34]

Analyze aspects A, B, and C

Atomic Decomposition: Break down complex requirements into minimal, independent checkpoints • Each requirement checks only one specific point • Example: “Analyze aspects A, B, and C" → Split into “Mention A", “Mention B", “Mention C" • Example: “Compare X and Y" → Split into “Describe X", “Describe Y", “Explain differences" 2.Short and Clear: Each require...

work page 2023

[33] [35]

Only part of the core meaning is covered (missing key details)

work page

[34] [36]

The topic is mentioned but specifics are absent

work page

[35] [37]

Related concept exists but not the exact point

work page

[36] [38]

Generalization without the specific insight

work page

[37] [39]

Shanghai’s garbage classification coverage rate will reach 95% by 2023

The connection requires inference (not explicit) Examples of 0.5: • Insight: “Shanghai’s garbage classification coverage rate will reach 95% by 2023" Report: “Shanghai’s garbage classification has achieved significant results"→ 0.5 (topic covered, but no specific percentage) • Insight: “Germany adopts a dual track recycling system" Report: “Developed coun...

work page 2023

[38] [40]

If >50% of core meaning is covered→1.0

work page

[39] [41]

If reasonable semantic connection exists→1.0

work page

[40] [42]

If only weak connection or keyword overlap→0.5

work page

[41] [43]

results": [ “id

If no connection at all→0.0 Principle: Prefer false positives over false negatives (The goal of recall assessment is to check if information is missing) RESPONSE FORMAT Respond ONLY with valid JSON (no markdown, no extra text): “results": [ “id": 1, “core_points": [“point1", “point2"], “found_in_report": “[quote or describe what was found]", “missing_poin...

work page

[42] [44]

A statement is ageneralization, summary, inference, or extensionof the content of the source document

work page

[43] [45]

The statements use different wording, buthave similar semantics

work page

[44] [46]

The statement containsimplicit informationfrom the source document

work page

[45] [47]

For images/videos: The content described may be visually visible or inferable

work page

[46] [48]

The statement is areasonable interpretationof the content of the source document, even if it is not the only interpretation

work page

[47] [49]

The source document containspartially supportingcontent for this statement Situations where it is determined as supported: false (limited to the following situations) Only when one of the following conditions is met, it is determined as false:

work page

[48] [50]

Statements that aredirectly contradictoryto the source document (such as significant errors in numbers or completely opposite facts)

work page

[49] [51]

The source documentcompletely lacksany relevant content stated

work page

[50] [52]

The company’s revenue increased by 25% in 2023

The statement cannot be reasonably inferred from the source document Judgment principles • Allowsubstantial generalization and inference • Allowwording differencesanddifferent ways of expression • Allowpartially correctstatements (as long as they are not completely wrong) • For situations that areambiguous or uncertain, they should all be determined as tr...

work page 2023

[51] [53]

Look for subtle problems, minor inconsistencies, areas that could be improved, or any shortcomings that might affect the quality

A research question that the report attempts to answer <research_question> Question </research_question> <Report> result_text </Report> Instructions: ANALYZE THOROUGHLY: Examine the report in detail and identify any issues, even small ones. Look for subtle problems, minor inconsistencies, areas that could be improved, or any shortcomings that might affect...

work page

[52] [54]

Do NOT cluster scores in a narrow range

Use the FULL scoring range: Distribute scores across 1-10 based on actual quality differ- ences. Do NOT cluster scores in a narrow range

work page

[53] [55]

Only truly exceptional work deserves 10

Differentiate clearly: A mediocre report should score 4-5, a good report 6-7, an excellent report 8-9. Only truly exceptional work deserves 10

work page

[54] [56]

Better analysis, clearer structure, and deeper insights should result in higher scores

Be discriminating: Look for specific quality differences between reports. Better analysis, clearer structure, and deeper insights should result in higher scores

work page

[55] [57]

Penalize appropriately: Minor issues = small deductions (0.5-1 point), major issues = significant deductions (2-3 points)

work page

[56] [58]

Reward excellence: If a report demonstrates exceptional depth, clarity, or insight, give it the high score it deserves

work page

[57] [59]

Compare mentally: Consider how this report compares to the best and worst possible reports on this topic. Evaluation Criterion: Depth & Quality of Analysis Evaluate how thoroughly the report analyzes the research question.BE HARSH: Look for superficiality, missing details, lack of evidence, weak reasoning. •1-2: Completely superficial, no real analysis, j...

work page