Hunt Instead of Wait: Evaluating Deep Data Research on Large Language Models

Michele Orini; Peijie Yu; Wei Liu; Yali Du; Yulan He

arxiv: 2602.02039 · v2 · pith:ECTZRGFWnew · submitted 2026-02-02 · 💻 cs.AI · cs.CL· cs.DB· cs.LG

Hunt Instead of Wait: Evaluating Deep Data Research on Large Language Models

Wei Liu , Peijie Yu , Michele Orini , Yali Du , Yulan He This is my paper

Pith reviewed 2026-05-21 13:58 UTC · model grok-4.3

classification 💻 cs.AI cs.CLcs.DBcs.LG

keywords investigatory intelligencedeep data researchagentic LLMsDDR-Benchautonomous explorationdata science benchmarksLLM agency

0 comments

The pith

Investigatory intelligence in LLMs requires intrinsic exploration strategies beyond scaling or scaffolding.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper distinguishes investigatory intelligence, where agentic LLMs set their own goals and decide what to explore from raw data, from executional intelligence that follows given tasks. It introduces Deep Data Research as an open-ended task and DDR-Bench as a checklist-based benchmark to test autonomous insight extraction from databases. Evaluations of frontier models show emerging agency alongside clear shortfalls in sustaining long-horizon exploration. The central result is that success hinges on the models' own internal strategies rather than external agent setups or increased scale alone.

Core claim

Defining investigatory intelligence as autonomy to set goals and explore in data contexts, the authors build DDR and DDR-Bench to evaluate LLMs starting from raw databases without explicit queries. Results indicate frontier models display some agency yet face persistent challenges with extended exploration, showing that intrinsic strategies within agentic models determine effective investigatory performance.

What carries the argument

DDR-Bench, the checklist-based benchmark that turns open-ended database exploration into verifiable scores for autonomous insight extraction.

Load-bearing premise

The checklist-based DDR-Bench measures genuine investigatory intelligence instead of model behaviors tuned to this particular evaluation format.

What would settle it

A demonstration that high DDR-Bench scores can be reached by models that follow checklists mechanically yet fail to extract useful insights when given new databases outside the benchmark.

Figures

Figures reproduced from arXiv: 2602.02039 by Michele Orini, Peijie Yu, Wei Liu, Yali Du, Yulan He.

**Figure 1.** Figure 1: Inference-time scaling performance in DDR-Bench across different dimensions. The y-axis reports checklist accuracy. Beyond final accuracy, DDR-Bench provides rich test-time exploration information from different scaling dimensions, enabling detailed analysis of model agency behaviour. See details in §4. 1. Introduction Agentic large language models (Agentic LLMs) (Zhang et al., 2025b) extend conventional L… view at source ↗

**Figure 2.** Figure 2: Left: Compared with previous tasks, DDR maximises exploration openness and agency, focusing on the direct evaluation of insight quality. Right: Overview of the DDR-Bench. Details of the trajectory samples are shown in Appendix H. haviour. (Plaat et al., 2025; Wang et al., 2025a; Yao et al., 2023; Xu et al., 2024; Wang et al., 2025b; Yehudai et al., 2025). However, most existing evaluations of agentic LLMs … view at source ↗

**Figure 3.** Figure 3: A case of Claude Sonnet 4.5’s trajectory and evaluation checklist in the MIMIC scenario of DDR-Bench. Verified fact and supporting insights are underlined. See details of this trajectory in Figure A16. The patient id is anonymised. on assessing model capabilities such as tool-use or longhorizon reasoning, without confounding effects from external scaffolding. Third, exploration is unrestricted. No upper … view at source ↗

**Figure 4.** Figure 4: Ranking correlation between novelty and accuracy on Proprietary and Open-Source LLMs. Circles denote the novelty rank, and diamonds denote the accuracy rank. Models are ordered by accuracy rank in the figure. All three scenarios present high correlation. Instead, DDR-Bench captures the dominant insight signal. Models with higher checklist scores also tend to generate novel insights that are judged to be mo… view at source ↗

**Figure 5.** Figure 5: Exploration patterns of different models. The x-axis denotes exploration entropy, reflecting the depth of the model’s search over the database, while the y-axis represents database coverage, indicating the breadth of the search. 0 20 40 60 80 Interaction Progress (%) 7 6 5 4 3 2 Avg Log Probability MIMIC 0 20 40 60 80 Interaction Progress (%) GLOBEM 0 20 40 60 80 Interaction Progress (%) 10-K Qwen2.5 Qwen3… view at source ↗

**Figure 6.** Figure 6: Self-termination visualisation on the Qwen family. jectory prefix containing varying numbers of turns. The results are shown in [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗

**Figure 7.** Figure 7: Training-time factors study within the Qwen family. From left to right, the three columns examine inference-time scaling performance across all scenarios for models with different parameter scales, different context optimisation methods, and different model generations with different training strategies. pipeline, encompassing both pre-training and post-training stages (Yang et al., 2025a). 5.2. Agent Modu… view at source ↗

**Figure 8.** Figure 8: Distribution of manually annotated error types across models and task scenarios. trend for an upward one. Less capable models, on the other hand, tended to make more fundamental errors, such as repeatedly debugging or struggling with missing data, which could disrupt the overall coherence of the analysis. 7. Hallucination Evaluation [PITH_FULL_IMAGE:figures/full_fig_p012_8.png] view at source ↗

read the original abstract

The agency expected of Agentic Large Language Models goes beyond answering correctly, requiring autonomy to set goals and decide what to explore. We term this investigatory intelligence, distinguishing it from executional intelligence, which merely completes assigned tasks. Data Science provides a natural testbed, as real-world analysis starts from raw data rather than explicit queries, yet few benchmarks focus on it. To address this, we introduce Deep Data Research (DDR), an open-ended task where LLMs autonomously extract key insights from databases, and DDR-Bench, a large-scale, checklist-based benchmark that enables verifiable evaluation. Results show that while frontier models display emerging agency, long-horizon exploration remains challenging. Our analysis highlights that effective investigatory intelligence depends not only on agent scaffolding or merely scaling, but also on intrinsic strategies of agentic models.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper introduces DDR-Bench for open-ended data exploration but the checklist approach may reward standard analysis routines more than true autonomous goal-setting.

read the letter

The key takeaway is that this work defines investigatory intelligence as the ability of LLMs to set their own goals and explore raw databases for insights, separate from just executing given tasks. They back this with DDR-Bench, a large-scale checklist-based evaluation. What the paper does well is spot a limitation in current benchmarks. Most existing ones give models explicit queries or tasks, but real data science often starts with raw data and requires deciding what to look for. The DDR task tries to capture that open-ended nature, and the results suggest frontier models have some emerging capability here but struggle with sustained exploration over long horizons. This points to the need for better intrinsic strategies beyond just bigger models or added scaffolding. The checklist for scoring is a practical choice because it makes open-ended outputs verifiable without relying on subjective judgment. That helps make the benchmark usable at scale. The soft spot is that this same checklist could let models rack up points by methodically applying a fixed set of data analysis steps, like computing summaries, correlations, and plots, without any actual novel goal-setting or insight. If performance tracks more with how well the model has internalized standard data science routines from training data, then the attribution to intrinsic strategies weakens. The paper would be stronger with more information on how the checklists were created, whether they were validated by domain experts, and what steps were taken to prevent gaming through exhaustive but shallow exploration. Details on the specific models, databases, and any controls for statistical significance are also thin in the writeup. This paper is mainly for researchers developing agentic LLMs and those designing new evaluation frameworks. Anyone thinking about how to test for more independent AI behavior in analytical domains could find the task setup helpful. The core idea holds up, but the implementation needs tighter validation to support the conclusions firmly. I would recommend sending it to peer review. The benchmark idea is worth developing, and referees could help refine the evaluation to better isolate the intended capabilities.

Referee Report

2 major / 2 minor

Summary. The paper introduces the concept of investigatory intelligence for agentic LLMs, distinguishing it from executional intelligence. It defines Deep Data Research (DDR) as an open-ended task requiring autonomous insight extraction from raw databases and presents DDR-Bench, a large-scale checklist-based benchmark for verifiable evaluation of such tasks. Empirical results on frontier models indicate emerging agency but persistent difficulties with long-horizon exploration, leading to the claim that effective investigatory intelligence depends on intrinsic model strategies beyond agent scaffolding or scale alone.

Significance. If the benchmark and results hold, the work fills a gap in evaluating autonomous, open-ended data exploration capabilities in LLMs and provides evidence that intrinsic strategies contribute to agentic performance. This could inform future benchmark design and model development for long-horizon agency in AI systems.

major comments (2)

[DDR-Bench construction and evaluation protocol] The central claim that performance differences reflect intrinsic strategies rather than scaffolding or scale alone rests on DDR-Bench providing a reliable measure of investigatory intelligence. However, the checklist approach (detailed in the benchmark construction) risks scoring models highly for exhaustive adherence to standard data-science steps (summary statistics, correlations, visualizations) without requiring genuine goal-setting or novel insight discovery, which could undermine attribution to intrinsic strategies.
[Results and experimental setup] The abstract and results sections report outcomes on frontier models but provide insufficient detail on model selection criteria, database construction process, checklist validation against human experts, or statistical controls for prompt engineering effects. This makes it difficult to assess whether observed differences support the distinction between investigatory and executional intelligence.

minor comments (2)

[Benchmark description] Clarify the exact number of databases and tasks in DDR-Bench, including any statistics on task diversity or difficulty distribution.
[Results figures] Ensure all figures include error bars or confidence intervals for model performance comparisons.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback, which has helped us identify opportunities to strengthen the clarity and rigor of our presentation. We address each major comment point by point below, indicating the specific revisions we will make to the next version of the manuscript.

read point-by-point responses

Referee: [DDR-Bench construction and evaluation protocol] The central claim that performance differences reflect intrinsic strategies rather than scaffolding or scale alone rests on DDR-Bench providing a reliable measure of investigatory intelligence. However, the checklist approach (detailed in the benchmark construction) risks scoring models highly for exhaustive adherence to standard data-science steps (summary statistics, correlations, visualizations) without requiring genuine goal-setting or novel insight discovery, which could undermine attribution to intrinsic strategies.

Authors: We agree that a purely procedural checklist could conflate routine execution with genuine investigatory behavior. In the revised manuscript we have expanded the Benchmark Construction section to explicitly describe how each checklist item was derived from expert-identified key insights that require autonomous goal-setting and non-standard analytical paths (e.g., discovering latent interactions not captured by default summary statistics). We have also added concrete examples contrasting checklist items that reward rote steps versus those that reward novel insight discovery. Finally, we have inserted a short limitations subsection acknowledging that checklist verification remains an imperfect proxy and outlining planned future work on open-ended human judgment protocols. These changes preserve the original claim while making the link to intrinsic strategies more transparent. revision: yes
Referee: [Results and experimental setup] The abstract and results sections report outcomes on frontier models but provide insufficient detail on model selection criteria, database construction process, checklist validation against human experts, or statistical controls for prompt engineering effects. This makes it difficult to assess whether observed differences support the distinction between investigatory and executional intelligence.

Authors: We accept that the original experimental description was insufficiently detailed for full reproducibility and for isolating the investigatory-versus-executional distinction. In the revised version we have substantially expanded the Experimental Setup section with: (i) explicit model-selection criteria (frontier models chosen for documented agentic tool-use capabilities rather than scale alone); (ii) the full database-generation pipeline, including how ground-truth insights were embedded and verified; (iii) the multi-expert checklist validation protocol together with inter-rater agreement statistics; and (iv) additional statistical controls (multiple prompt templates, permutation tests, and regression analysis of prompt-engineering variance). These additions directly support the claim that performance gaps reflect intrinsic model strategies beyond scaffolding or scale. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical benchmark results independent of self-defined quantities

full rationale

The paper introduces DDR and DDR-Bench as an open-ended task and checklist-based evaluation for investigatory intelligence in LLMs, with claims resting on observed performance differences across models. No equations, fitted parameters, or derivations are described that reduce predictions to inputs by construction. The analysis of intrinsic strategies versus scaffolding or scale is presented as an empirical finding from benchmark runs rather than a self-referential definition or self-citation chain. The setup is self-contained against external benchmarks and does not invoke uniqueness theorems or ansatzes from prior author work.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No free parameters, axioms, or invented entities are specified in the abstract; the work is presented as an empirical benchmark introduction.

pith-pipeline@v0.9.0 · 5679 in / 1000 out tokens · 28595 ms · 2026-05-21T13:58:51.582985+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We term this investigatory intelligence, distinguishing it from executional intelligence... DDR-Bench, a large-scale, checklist-based benchmark that enables verifiable evaluation.
IndisputableMonolith/Foundation/DimensionForcing.lean alexander_duality_circle_linking unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

exploration entropy... Normalised Exploration Entropy... balanced exploration regime

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Rewarding the Scientific Process: Process-Level Reward Modeling for Agentic Data Analysis
cs.CL 2026-04 unverdicted novelty 7.0

DataPRM is a new process reward model for data analysis agents that detects silent errors via environment interaction and ternary rewards, yielding 7-11% gains on benchmarks and further RL improvements.

Reference graph

Works this paper leans on

22 extracted references · 22 canonical work pages · cited by 1 Pith paper · 1 internal anchor

[1]

findings-acl.1016/

URL https://aclanthology.org/2025. findings-acl.1016/. 15 Evaluating Deep Data Research on LLMs Islam, M. S., Laskar, M. T. R., Parvez, M. R., Hoque, E., and Joty, S. Datanarrative: Automated data-driven sto- rytelling with visualizations and texts.arXiv preprint arXiv:2408.05346, 2024. Johnson, A. E., Bulgarelli, L., Shen, L., Gayles, A., Sham- mout, A.,...

work page doi:10.18653/v1/2025.findings-emnlp 2025
[2]

findings-emnlp.410/

URL https://aclanthology.org/2025. findings-emnlp.410/. Lu, W., Zhang, J., Fan, J., Fu, Z., Chen, Y ., and Du, X. Large language model for table processing: a sur- vey.Frontiers Comput. Sci., 19(2):192350, 2025a. doi: 10.1007/S11704-024-40763-6. URL https://doi. org/10.1007/s11704-024-40763-6. Lu, Y ., Yang, S., Qian, C., Chen, G., Luo, Q., Wu, Y ., Wang,...

work page doi:10.1007/s11704-024-40763-6 2025
[6]

Yao, Y ., Wang, Y ., Zhang, Y ., Lu, Y ., Gu, T., Li, L., Zhao, D., Wu, K., Wang, H., Nie, P., Teng, Y ., and Wang, Y

URL https://openreview.net/forum? id=WE_vluYUL-X. Yao, Y ., Wang, Y ., Zhang, Y ., Lu, Y ., Gu, T., Li, L., Zhao, D., Wu, K., Wang, H., Nie, P., Teng, Y ., and Wang, Y . A rigorous benchmark with multidimensional evaluation for deep research agents: From answers to reports, 2025. URLhttps://arxiv.org/abs/2510.02190. Yehudai, A., Eden, L., Li, A., Uziel, G...

work page arXiv 2025
[7]

Survey on Evaluation of LLM-based Agents

doi: 10.48550/ARXIV .2503.16416. URL https: //doi.org/10.48550/arXiv.2503.16416. Yu, T., Zhang, R., Yang, K., Yasunaga, M., Wang, D., Li, Z., Ma, J., Li, I., Yao, Q., Roman, S., Zhang, Z., and Radev, D. R. Spider: A Large-scale Human-labeled Dataset for Complex and Cross-domain Semantic Parsing and Text- to-SQL Task. In Riloff, E., Chiang, D., Hockenmaier...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv 2018
[8]

NO INSIGHT

doi: 10.18653/V1/D18-1425. URL https:// doi.org/10.18653/v1/d18-1425. Zhang, C., Yang, K., Hu, S., Wang, Z., Li, G., Sun, Y ., Zhang, C., Zhang, Z., Liu, A., Zhu, S.-C., Chang, X., Zhang, J., Yin, F., Liang, Y ., and Yang, Y . Proagent: Building proactive cooperative agents with large lan- guage models, 2024. URL https://arxiv.org/ abs/2308.11339. Zhang, ...

work page doi:10.18653/v1/d18-1425 2024
[9]

insufficient to support any insight

work page
[10]

failed function call

work page
[11]

when the interaction only invokes descriptive tools such as list_files, describe_table, get_database_info, orget_field_description The proportion of meaningful insights is then computed over all generated message-wise insights. In practice, at least two to three interactions are expected to involve descriptive tool calls and therefore produce no insight, ...

work page 2000
[12]

Determine if the messages can provide evidence to support the answer

work page
[13]

Identify which specific message(s) by their index numbers[Message X]support or contradict the answer

work page
[14]

Extract the evidence text from the relevant message(s)

work page
[15]

The proportion of CORRECT_INFO is calculated as the final accuracy

Classify the context quality into one of three categories: CORRECT_INFO: Messages contain information that serves as evidence or support for the answer; INCORRECT_INFO: Messages contain information that contradicts the answer; INSUFFICIENT_INFO: Messages lack sufficient information to answer the question. The proportion of CORRECT_INFO is calculated as th...

work page
[16]

RETURN BOTH THE TEXT CONTENT AND THE TOOL CALL

always respond in a ReAct style: return what you’re thinking and planning to do, and then call the appropriate tool. RETURN BOTH THE TEXT CONTENT AND THE TOOL CALL

work page
[17]

you can only call one tool every turn

work page
[18]

BUT DO NOT INCLUDE ANY INSIGHTS OR REASONING IN THE TOOL CALLS

your reasoning should contain insights derived from last turn’s tool call results. BUT DO NOT INCLUDE ANY INSIGHTS OR REASONING IN THE TOOL CALLS. TOOLS ARE ONLY FOR DATA EXPLORE

work page
[19]

keep exploring, build more and more complex params as turns go on and you will discover more in the data

you should try your best to use the tools to get more information. keep exploring, build more and more complex params as turns go on and you will discover more in the data

work page
[20]

FINISH:" followed by your all insights collected from the whole dialogue and tool calls. Only use

first use the tools to check what data is available to you. TASK COMPLETION: When you can not gather more information, send a message that starts with "FINISH:" followed by your all insights collected from the whole dialogue and tool calls. Only use "FINISH:" when you are absolutely certain that no more information can be gathered. Carefully use "FINISH:"...

work page
[21]

It has to be related to the task: task

work page
[22]

If there is no insight or error in the tool execution, respond with ’NO INSIGHT’

work page
[23]

tools likelist_files, describe_table , get_database_info, get_field_description), respond with ’NO INSIGHT’

If it only use the data description tools (e.g. tools likelist_files, describe_table , get_database_info, get_field_description), respond with ’NO INSIGHT’

work page
[24]

Focus on this point

The insight from data should answer the question raised in the reason to execute this tool. Focus on this point

work page
[25]

tool": "get_database_info

Keep all the data or statistics needed in your generated insight. ONLY respond with the insight. Figure A12.Prompt for Generating Message-wise InsightI m in DDR-Bench. Furthermore, the prompt includes scenario-specific evaluation criteria as well as JSON return fields to facilitate downstream data processing. Details can be found in the project code ateva...

work page 2008

[1] [1]

findings-acl.1016/

URL https://aclanthology.org/2025. findings-acl.1016/. 15 Evaluating Deep Data Research on LLMs Islam, M. S., Laskar, M. T. R., Parvez, M. R., Hoque, E., and Joty, S. Datanarrative: Automated data-driven sto- rytelling with visualizations and texts.arXiv preprint arXiv:2408.05346, 2024. Johnson, A. E., Bulgarelli, L., Shen, L., Gayles, A., Sham- mout, A.,...

work page doi:10.18653/v1/2025.findings-emnlp 2025

[2] [2]

findings-emnlp.410/

URL https://aclanthology.org/2025. findings-emnlp.410/. Lu, W., Zhang, J., Fan, J., Fu, Z., Chen, Y ., and Du, X. Large language model for table processing: a sur- vey.Frontiers Comput. Sci., 19(2):192350, 2025a. doi: 10.1007/S11704-024-40763-6. URL https://doi. org/10.1007/s11704-024-40763-6. Lu, Y ., Yang, S., Qian, C., Chen, G., Luo, Q., Wu, Y ., Wang,...

work page doi:10.1007/s11704-024-40763-6 2025

[3] [6]

Yao, Y ., Wang, Y ., Zhang, Y ., Lu, Y ., Gu, T., Li, L., Zhao, D., Wu, K., Wang, H., Nie, P., Teng, Y ., and Wang, Y

URL https://openreview.net/forum? id=WE_vluYUL-X. Yao, Y ., Wang, Y ., Zhang, Y ., Lu, Y ., Gu, T., Li, L., Zhao, D., Wu, K., Wang, H., Nie, P., Teng, Y ., and Wang, Y . A rigorous benchmark with multidimensional evaluation for deep research agents: From answers to reports, 2025. URLhttps://arxiv.org/abs/2510.02190. Yehudai, A., Eden, L., Li, A., Uziel, G...

work page arXiv 2025

[4] [7]

Survey on Evaluation of LLM-based Agents

doi: 10.48550/ARXIV .2503.16416. URL https: //doi.org/10.48550/arXiv.2503.16416. Yu, T., Zhang, R., Yang, K., Yasunaga, M., Wang, D., Li, Z., Ma, J., Li, I., Yao, Q., Roman, S., Zhang, Z., and Radev, D. R. Spider: A Large-scale Human-labeled Dataset for Complex and Cross-domain Semantic Parsing and Text- to-SQL Task. In Riloff, E., Chiang, D., Hockenmaier...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv 2018

[5] [8]

NO INSIGHT

doi: 10.18653/V1/D18-1425. URL https:// doi.org/10.18653/v1/d18-1425. Zhang, C., Yang, K., Hu, S., Wang, Z., Li, G., Sun, Y ., Zhang, C., Zhang, Z., Liu, A., Zhu, S.-C., Chang, X., Zhang, J., Yin, F., Liang, Y ., and Yang, Y . Proagent: Building proactive cooperative agents with large lan- guage models, 2024. URL https://arxiv.org/ abs/2308.11339. Zhang, ...

work page doi:10.18653/v1/d18-1425 2024

[6] [9]

insufficient to support any insight

work page

[7] [10]

failed function call

work page

[8] [11]

when the interaction only invokes descriptive tools such as list_files, describe_table, get_database_info, orget_field_description The proportion of meaningful insights is then computed over all generated message-wise insights. In practice, at least two to three interactions are expected to involve descriptive tool calls and therefore produce no insight, ...

work page 2000

[9] [12]

Determine if the messages can provide evidence to support the answer

work page

[10] [13]

Identify which specific message(s) by their index numbers[Message X]support or contradict the answer

work page

[11] [14]

Extract the evidence text from the relevant message(s)

work page

[12] [15]

The proportion of CORRECT_INFO is calculated as the final accuracy

Classify the context quality into one of three categories: CORRECT_INFO: Messages contain information that serves as evidence or support for the answer; INCORRECT_INFO: Messages contain information that contradicts the answer; INSUFFICIENT_INFO: Messages lack sufficient information to answer the question. The proportion of CORRECT_INFO is calculated as th...

work page

[13] [16]

RETURN BOTH THE TEXT CONTENT AND THE TOOL CALL

always respond in a ReAct style: return what you’re thinking and planning to do, and then call the appropriate tool. RETURN BOTH THE TEXT CONTENT AND THE TOOL CALL

work page

[14] [17]

you can only call one tool every turn

work page

[15] [18]

BUT DO NOT INCLUDE ANY INSIGHTS OR REASONING IN THE TOOL CALLS

your reasoning should contain insights derived from last turn’s tool call results. BUT DO NOT INCLUDE ANY INSIGHTS OR REASONING IN THE TOOL CALLS. TOOLS ARE ONLY FOR DATA EXPLORE

work page

[16] [19]

keep exploring, build more and more complex params as turns go on and you will discover more in the data

you should try your best to use the tools to get more information. keep exploring, build more and more complex params as turns go on and you will discover more in the data

work page

[17] [20]

FINISH:" followed by your all insights collected from the whole dialogue and tool calls. Only use

first use the tools to check what data is available to you. TASK COMPLETION: When you can not gather more information, send a message that starts with "FINISH:" followed by your all insights collected from the whole dialogue and tool calls. Only use "FINISH:" when you are absolutely certain that no more information can be gathered. Carefully use "FINISH:"...

work page

[18] [21]

It has to be related to the task: task

work page

[19] [22]

If there is no insight or error in the tool execution, respond with ’NO INSIGHT’

work page

[20] [23]

tools likelist_files, describe_table , get_database_info, get_field_description), respond with ’NO INSIGHT’

If it only use the data description tools (e.g. tools likelist_files, describe_table , get_database_info, get_field_description), respond with ’NO INSIGHT’

work page

[21] [24]

Focus on this point

The insight from data should answer the question raised in the reason to execute this tool. Focus on this point

work page

[22] [25]

tool": "get_database_info

Keep all the data or statistics needed in your generated insight. ONLY respond with the insight. Figure A12.Prompt for Generating Message-wise InsightI m in DDR-Bench. Furthermore, the prompt includes scenario-specific evaluation criteria as well as JSON return fields to facilitate downstream data processing. Details can be found in the project code ateva...

work page 2008