LakeQA: An Exploratory QA Benchmark over a Million-Scale Data Lake

Austin Senna Wijaya; Daniela Pinto; Eden Wu; Eugene Wu; Grace Fan; Haonan Wang; Jiaxiang Liu; Juliana Freire; Reya Vir; Tianle Zhou

arxiv: 2606.10460 · v1 · pith:NA2DAEULnew · submitted 2026-06-09 · 💻 cs.CL · cs.AI

LakeQA: An Exploratory QA Benchmark over a Million-Scale Data Lake

Haonan Wang , Jiaxiang Liu , Yurong Liu , Austin Senna Wijaya , Tianle Zhou , Eden Wu , Yijia Chen , Wanting You

show 6 more authors

Reya Vir Daniela Pinto Grace Fan Yusen Zhang Juliana Freire Eugene Wu

This is my paper

Pith reviewed 2026-06-27 13:39 UTC · model grok-4.3

classification 💻 cs.CL cs.AI

keywords question answeringdata lakesbenchmarkslarge language modelsmulti-hop reasoninginformation retrievalagent evaluation

0 comments

The pith

LakeQA is a benchmark that tests whether language models can search a 9.5 TB data lake and compose multi-hop answers from heterogeneous sources.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper constructs LakeQA to address the gap between reading-based QA, where evidence is given, and real-world questions whose evidence must first be located inside large data lakes. The benchmark draws from Wikipedia and open government sources totaling roughly 9.5 TB of structured and unstructured text. Each question is written so that an agent must locate the right documents and then chain implicit intermediate facts across them. When seven frontier models are evaluated, the strongest result is an exact-match score of 18.37 percent, showing that current systems still lack reliable search-plus-reasoning behavior on this scale.

Core claim

LakeQA is a comprehensive benchmark for search-centric question answering over data lakes that jointly emphasizes searching and reasoning capabilities. It is built on a heterogeneous collection of approximately 9.5 TB of text resources from Wikipedia and open-source government data. Each sample is annotated by at least one Ph.D.-level expert, and each task requires long-horizon multi-hop reasoning with implicit intermediate steps: agents need to discover the correct documents and then compose evidence across sources to produce the answer. Experimental results on seven frontier LLMs demonstrate that LakeQA is challenging, with GPT-5.2 achieving only an exact-match score of 18.37 percent.

What carries the argument

LakeQA benchmark, a collection of expert-annotated questions over 9.5 TB of mixed structured and unstructured sources that forces an agent to locate documents before performing multi-hop composition.

If this is right

Current LLMs cannot reliably answer questions whose supporting facts are scattered across a data lake and must be located first.
Progress on LakeQA would require agents that interleave document discovery with evidence composition.
The benchmark supplies a concrete testbed for measuring whether new retrieval-augmented or agentic systems close the observed performance gap.
Scores on LakeQA are expected to remain low until models improve at handling implicit multi-hop paths rather than surface-level retrieval.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Success on LakeQA would likely transfer to private enterprise data lakes that share the same scale and heterogeneity.
The construction method could be reused to create similar benchmarks for domains such as scientific literature or legal corpora.
If models improve on LakeQA while retrieval quality is held fixed, the gain would isolate advances in the reasoning step itself.

Load-bearing premise

Expert annotation by Ph.D.-level reviewers guarantees that every question truly demands long-horizon search and multi-hop reasoning across heterogeneous sources rather than being solvable from a single document or trivial lookup.

What would settle it

A controlled experiment in which frontier models, given only the LakeQA questions and no retrieval tools, achieve exact-match scores above 70 percent, or an audit showing that most questions can be answered from one document without implicit intermediate steps.

Figures

Figures reproduced from arXiv: 2606.10460 by Austin Senna Wijaya, Daniela Pinto, Eden Wu, Eugene Wu, Grace Fan, Haonan Wang, Jiaxiang Liu, Juliana Freire, Reya Vir, Tianle Zhou, Wanting You, Yijia Chen, Yurong Liu, Yusen Zhang.

**Figure 2.** Figure 2: An illustrative task in LAKEQA demonstrating multi-hop exploratory question answering over a heterogeneous data lake. Given a complex natural language user question, an agent must iteratively decompose it into a sequence of sub-tasks that each depends on the answers from previous sub-tasks. contain millions of documents spanning both structured and unstructured formats (Halevy et al., 2016; Chapman et al.,… view at source ↗

**Figure 3.** Figure 3: Overview of LAKEQA creation process. (1) Data Lake Construction: We construct a heterogeneous collection of ∼9.5 TB from Wikipedia and data.gov, including both structured and unstructured data. (2) Task Creation: Annotators create multi-hop QA tasks through an iterative annotation process: we first derive facts from a document and reformulate them into subquestions (A), then chain subquestions by finding n… view at source ↗

**Figure 4.** Figure 4: Distribution of task domains. Models. We evaluate our benchmark on seven frontier LLMs spanning both proprietary and open-source model families. For proprietary models, we include OpenAI gpt-5.2 and gpt-5-mini, and Anthropic Claude Haiku 4.5, Claude Sonnet 4.5, and Claude Opus 4.5. For open-source models, we include DeepSeek-R1 and Llama-3.3-70B-Instruct. We only evaluate Claude Opus 4.5 on LAKEQA-mini due… view at source ↗

**Figure 5.** Figure 5: Average EM on LAKEQA stratified by number of hops per task; each bar reports the mean EM over tasks in the corresponding range. RQ1: How do LLMs perform on EQA tasks. Overall, all models achieve relatively low EM, indicating that EQA over a large, heterogeneous data lake remains challenging even for frontier LLMs. On LAKEQA-full, the best end-to-end performance is achieved by Anthropic Claude-sonnet-4.5 (… view at source ↗

read the original abstract

Recent large language models (LLMs) have shown rapid progress in reading-based question answering (QA), where evidence is explicitly provided or can be trivially retrieved. In contrast, real-world questions are often not paired with accurate evidence documents. The useful evidence resides in massive data lakes, making search a prerequisite for answering. However, there is a lack of comprehensive benchmarks that require both searching and reasoning over large data lakes. To this end, we introduce LakeQA, a comprehensive benchmark for search-centric question answering over data lakes that jointly emphasizes searching and reasoning capabilities. LakeQA is built on a heterogeneous collection of approximately 9.5 TB of text resources from Wikipedia and open-source government data, spanning structured and unstructured data. To ensure task quality, each sample is annotated by at least one Ph.D.-level expert. Each task requires long-horizon multi-hop reasoning with implicit intermediate steps: agents need to discover the correct documents and then compose evidence across sources to produce the answer. Experimental results on seven frontier LLMs demonstrate that LakeQA is challenging. For instance, GPT-5.2 achieves only an exact-match score of 18.37% on LakeQA. Overall, LakeQA provides a realistic testbed for developing LLM agents that can both find and analyze data in modern data lakes.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

LakeQA introduces a large-scale data-lake QA benchmark with expert annotation and reports low LLM scores, but the multi-hop claim lacks supporting details on task construction.

read the letter

LakeQA is a new benchmark built on roughly 9.5 TB of mixed Wikipedia and government data that tries to test both document search and cross-source reasoning in one setup. The paper notes that most existing QA work supplies the evidence or makes retrieval trivial, while real questions do not, and it shows seven frontier models scoring low, with GPT-5.2 at 18.37% exact match.

What stands out is the scale and the heterogeneous sources. That combination is not common in current reading-comprehension benchmarks, and the low model numbers give a concrete signal that current systems still struggle when search is required first.

The soft spot is the central claim that every task demands long-horizon multi-hop reasoning with implicit intermediate steps. The only mechanism described is annotation by at least one Ph.D.-level expert. There is no inter-annotator agreement, no per-question hop counts, no pilot comparison against single-hop or surface-matching baselines, and no sampling or exclusion rules. Without those, it is not possible to tell how many items actually require the advertised behavior versus how many could be solved more directly.

The work is aimed at groups building LLM agents for data-lake search and analysis. Readers who need a realistic testbed for retrieval-plus-reasoning pipelines would get value once the construction details are expanded.

It deserves peer review. The idea fills a visible gap, but referees would need to verify the task properties and evaluation choices before the benchmark can be used with confidence.

Referee Report

3 major / 2 minor

Summary. The paper introduces LakeQA, a benchmark for search-centric question answering over heterogeneous data lakes (~9.5 TB from Wikipedia and open government sources). It claims that each of its tasks requires long-horizon multi-hop reasoning with implicit intermediate steps across sources, ensured by annotation from at least one Ph.D.-level expert, and reports that seven frontier LLMs perform poorly (e.g., GPT-5.2 achieves 18.37% exact-match).

Significance. If the tasks are verifiably multi-hop and search-dependent, LakeQA would address a genuine gap between reading-based QA benchmarks and realistic data-lake settings, providing a useful testbed for agentic systems that must both retrieve and compose evidence.

major comments (3)

[Abstract and §3] Abstract and §3 (Benchmark Construction): the central claim that 'each task requires long-horizon multi-hop reasoning with implicit intermediate steps' and that 'agents need to discover the correct documents and then compose evidence across sources' rests solely on the statement that samples were annotated by Ph.D.-level experts. No inter-annotator agreement, no per-task reasoning-depth statistics, no pilot comparisons against single-hop or direct-lookup baselines, and no failure analysis of simpler retrieval methods are reported; without these, the 'search-centric' and 'multi-hop' characterizations cannot be assessed.
[§3 and §4] §3 and §4 (Data Construction and Evaluation): the manuscript provides no description of sampling strategy, exclusion rules, how heterogeneity across structured/unstructured sources is operationalized, or the precise evaluation metrics (exact match is mentioned but not defined or justified relative to partial credit or retrieval-aware measures). These omissions make it impossible to reproduce or interpret the reported LLM scores.
[§5] §5 (Experiments): the claim that LakeQA is 'challenging' is supported only by aggregate scores on seven LLMs; no ablation isolating search difficulty from reasoning difficulty, no error analysis by hop count, and no comparison to non-LLM baselines are supplied, weakening the assertion that the benchmark jointly tests both capabilities.

minor comments (2)

[Abstract] The scale claim ('million-scale data lake') should be accompanied by explicit counts of documents, tables, and entities rather than only the 9.5 TB figure.
[§4] Notation for the exact-match metric and any retrieval metrics should be defined in a dedicated evaluation subsection.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the detailed and constructive report. We address each major comment below and will make substantial revisions to improve documentation, add supporting analyses, and enhance reproducibility.

read point-by-point responses

Referee: [Abstract and §3] Abstract and §3 (Benchmark Construction): the central claim that 'each task requires long-horizon multi-hop reasoning with implicit intermediate steps' and that 'agents need to discover the correct documents and then compose evidence across sources' rests solely on the statement that samples were annotated by Ph.D.-level experts. No inter-annotator agreement, no per-task reasoning-depth statistics, no pilot comparisons against single-hop or direct-lookup baselines, and no failure analysis of simpler retrieval methods are reported; without these, the 'search-centric' and 'multi-hop' characterizations cannot be assessed.

Authors: We agree that additional quantitative support would strengthen the multi-hop and search-centric claims beyond expert annotation. The revised manuscript will report inter-annotator agreement on a multi-annotated subset, per-task reasoning-depth statistics extracted from expert notes, a pilot comparison of single-hop variants versus the full tasks, and failure rates of simpler retrieval baselines to better validate the characterizations. revision: yes
Referee: [§3 and §4] §3 and §4 (Data Construction and Evaluation): the manuscript provides no description of sampling strategy, exclusion rules, how heterogeneity across structured/unstructured sources is operationalized, or the precise evaluation metrics (exact match is mentioned but not defined or justified relative to partial credit or retrieval-aware measures). These omissions make it impossible to reproduce or interpret the reported LLM scores.

Authors: We acknowledge these omissions hinder reproducibility. The revised §3 and §4 will detail the sampling strategy from the 9.5 TB lake, explicit exclusion rules, how heterogeneity is operationalized (e.g., source-type distributions per question), and a precise definition of exact match with justification relative to alternatives such as token-level F1 or retrieval-aware metrics. revision: yes
Referee: [§5] §5 (Experiments): the claim that LakeQA is 'challenging' is supported only by aggregate scores on seven LLMs; no ablation isolating search difficulty from reasoning difficulty, no error analysis by hop count, and no comparison to non-LLM baselines are supplied, weakening the assertion that the benchmark jointly tests both capabilities.

Authors: We accept that aggregate LLM scores alone do not fully isolate search versus reasoning difficulty. The revised §5 will incorporate an ablation separating retrieval and composition stages, error analysis broken down by estimated hop count, and comparisons to non-LLM baselines (e.g., BM25 retrieval plus heuristic reasoning) to more rigorously demonstrate the joint challenges. revision: yes

Circularity Check

0 steps flagged

No derivations, equations, or fitted quantities present

full rationale

LakeQA is a benchmark introduction paper with no mathematical derivations, equations, parameter fittings, or prediction steps. Claims about task properties rest on expert annotation rather than any self-referential construction or self-citation chain that reduces to inputs. No load-bearing steps match the enumerated circularity patterns; the work is self-contained as a dataset description.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

No mathematical derivations or fitted parameters appear in the abstract. The work rests on domain assumptions about expert annotation quality and task design.

axioms (1)

domain assumption Ph.D.-level expert annotation ensures task quality and realism
Stated in abstract as the method used to ensure quality; no further justification provided.

pith-pipeline@v0.9.1-grok · 5801 in / 1181 out tokens · 22535 ms · 2026-06-27T13:39:55.672356+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

25 extracted references · 2 canonical work pages · 1 internal anchor

[1]

Chen, W., Chang, M.-W., Schlinger, E., Wang, W., and 9 LAKEQA : An Exploratory QA Benchmark over a Million-Scale Data Lake Cohen, W. W. Open question answering over tables and text.arXiv preprint arXiv:2010.10439, 2020a. Chen, W., Zha, H., Chen, Z., Xiong, W., Wang, H., and Wang, W. Y . Hybridqa: A dataset of multi-hop question answering over tabular and ...

arXiv 2010
[2]

Lakebench: A benchmark for discovering joinable and unionable tables in data lakes

Deng, Y ., Chai, C., Li, L., Luo, S., Qin, Y ., Lian, J., and Li, G. Lakebench: A benchmark for discovering joinable and unionable tables in data lakes. InProceedings of the VLDB Endowment, volume 17, pp. 1925–1938,

1925
[3]

Drop: A reading comprehension bench- mark requiring discrete reasoning over paragraphs.arXiv preprint arXiv:1903.00161,

Dua, D., Wang, Y ., Dasigi, P., Stanovsky, G., Singh, S., and Gardner, M. Drop: A reading comprehension bench- mark requiring discrete reasoning over paragraphs.arXiv preprint arXiv:1903.00161,

Pith/arXiv arXiv 1903
[4]

F., Olston, C., Polyzotis, N., Roy, S., and Whang, S

Halevy, A., Korn, F., Noy, N. F., Olston, C., Polyzotis, N., Roy, S., and Whang, S. E. Goods: Organizing google’s datasets. InProceedings of the 2016 International Con- ference on Management of Data, pp. 795–806,

2016
[5]

Characterizing deep research: A benchmark and formal definition.arXiv preprint arXiv:2508.04183,

Java, A., Khandelwal, A., Midigeshi, S., Halfaker, A., Desh- pande, A., Goyal, N., Gupta, A., Natarajan, N., and Sharma, A. Characterizing deep research: A benchmark and formal definition.arXiv preprint arXiv:2508.04183,

arXiv
[6]

Dense passage retrieval for open-domain question answering

Karpukhin, V ., Oguz, B., Min, S., Lewis, P., Wu, L., Edunov, S., Chen, D., and Yih, W.-t. Dense passage retrieval for open-domain question answering. InProceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 6769–6781,

2020
[7]

Looking beyond the surface: A challenge set for reading comprehension over multiple sentences

Khashabi, D., Chaturvedi, S., Roth, M., Upadhyay, S., and Roth, D. Looking beyond the surface: A challenge set for reading comprehension over multiple sentences. InPro- ceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pp. 252–262,

2018
[8]

Mm-browsecomp: A comprehensive benchmark for multimodal browsing agents.arXiv preprint arXiv:2508.13186,

Li, S., Bu, X., Wang, W., Liu, J., Dong, J., He, H., Lu, H., Zhang, H., Jing, C., Li, Z., et al. Mm-browsecomp: A comprehensive benchmark for multimodal browsing agents.arXiv preprint arXiv:2508.13186,

arXiv
[9]

Efficient and robust approximate nearest neighbor search using hierarchical navigable small world graphs

doi: 10.1109/TPAMI.2018.2889473. Microsoft. Overview of copilot in Fabric. https://le arn.microsoft.com/en-us/fabric/funda mentals/copilot-fabric-overview,

work page doi:10.1109/tpami.2018.2889473 2018
[10]

J., Pu, K

Nargesian, F., Zhu, E., Miller, R. J., Pu, K. Q., and Arocena, P. C. Data lake management: challenges and opportu- nities.Proceedings of the VLDB Endowment, 12(12): 1986–1989,

1986
[11]

Accessed: 2026-05-27. OpenAI. Inside OpenAI’s in-house data agent. https: //openai.com/index/inside-our-in-hou se-data-agent/,

2026
[12]

Lancedb - embracing composability in the storage layer

Pace, W., She, C., Xu, L., Jones, W., Meng, R., and Cen, Y . Lancedb - embracing composability in the storage layer. InVLDB 2025 Workshop: Third International Workshop on Composable Data Management Systems,

2025
[13]

Parikh, A

URL https://www.vldb.org/2025/Workshops/ VLDB-Workshops-2025/CDMS/CDMS25_15.p df. Parikh, A. P., Wang, X., Gehrmann, S., Faruqui, M., Dhin- gra, B., Yang, D., and Das, D. Totto: A controlled table- to-text generation dataset. InProceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1173–1186,

2025
[14]

Know what you don’t know: Unanswerable questions for squad.arXiv preprint arXiv:1806.03822,

Rajpurkar, P., Jia, R., and Liang, P. Know what you don’t know: Unanswerable questions for squad.arXiv preprint arXiv:1806.03822,

Pith/arXiv arXiv
[15]

The Web as a Knowledge-base for Answering Complex Questions

doi: 10.1561/1500000019. Talmor, A. and Berant, J. The web as a knowledge- base for answering complex questions.arXiv preprint arXiv:1803.06643,

work page internal anchor Pith review Pith/arXiv arXiv doi:10.1561/1500000019
[16]

Multi- modalqa: Complex question answering over text, tables and images.arXiv preprint arXiv:2104.06039,

Talmor, A., Yoran, O., Catav, A., Lahav, D., Wang, Y ., Asai, A., Ilharco, G., Hajishirzi, H., and Berant, J. Multi- modalqa: Complex question answering over text, tables and images.arXiv preprint arXiv:2104.06039,

arXiv
[17]

W., Passos, A

Wei, J., Sun, Z., Papay, S., McKinney, S., Han, J., Fulford, I., Chung, H. W., Passos, A. T., Fedus, W., and Glaese, A. Browsecomp: A simple yet challenging benchmark for browsing agents.arXiv preprint arXiv:2504.12516,

Pith/arXiv arXiv
[18]

Yang, Z., Qi, P., Zhang, S., Bengio, Y ., Cohen, W., Salakhut- dinov, R., and Manning, C. D. Hotpotqa: A dataset for diverse, explainable multi-hop question answering. In Proceedings of the 2018 conference on empirical methods in natural language processing, pp. 2369–2380,

2018
[19]

Autoddg: Automated dataset description generation using large language models.arXiv preprint arXiv:2502.01050, 2025a

Zhang, H., Liu, Y ., Santos, A., Hung, W.-L., and Freire, J. Autoddg: Automated dataset description generation using large language models.arXiv preprint arXiv:2502.01050, 2025a. Zhang, S. and Balog, K. Web table extraction, retrieval, and augmentation: A survey.ACM Transactions on Intelligent Systems and Technology, 11(2):1–35,

arXiv
[20]

Qwen3 embedding: Advancing text embedding and reranking through foundation models.arXiv preprint arXiv:2506.05176, 2025b

Zhang, Y ., Li, M., Long, D., Zhang, X., Lin, H., Yang, B., Xie, P., Yang, A., Liu, D., Lin, J., Huang, F., and Zhou, J. Qwen3 embedding: Advancing text embedding and reranking through foundation models.arXiv preprint arXiv:2506.05176, 2025b. Zhu, E., Deng, D., Nargesian, F., and Miller, R. J. JOSIE: Overlap set similarity search for finding joinable tabl...

Pith/arXiv arXiv 2019
[21]

[Text, Table, Image, Explicit Multi-hop Reasoning] MultiModalQA extends existing reading comprehension datasets such as Natural Questions (NQ) (Kwiatkowski et al., 2019), BoolQ (Clark et al., 2019), and HotpotQA (Yang et al.,

2019
[22]

[Table, Free-form Response, Wikipedia] FETAQA extends existing table QA benchmarks whose answers are typically short answers evaluated by exact matching by introducing long, informative free-form answers grounded in a single Wikipedia table. To construct such questions, FETAQA starts from ToTTo (Parikh et al., 2020), a large-scale table-to-text dataset co...

2020
[23]

result": ...} or {

Table 9.Distribution of benchmark tasks across data.gov theme categories. Theme Category Tasks Government & Admin 428 Environment 272 Transportation 54 Health & Social 324 Research & Demographics 35 Economy & Infrastructure 149 Public Safety 116 Education 179 F . Agent Interface Tool Implementations.All data access tools operate over a fixed S3 data lake ...

2020
[24]

question

who is US president after 2000 , and in a hop, we take an intersection between the two subquestions: who is a Democratics US President after 2000 , as long as the answer to the intersection is correct, the task is fine – because eventually each task is consist of the final question and its answer, all subquestion and facts are just for the sake of creatin...

2000
[25]

, "revision_subquestion

The intersection of all nodes in node 4 are still 12 districts. Yet, they somehow all transform into the 6 counties that contain those 12 districts any sources.", "revision_subquestion": "Add nodes 4 and 5 to convert the districts into counties. Here are the districts: Camas School DistrictCarbonado School DistrictColfax School DistrictDieringer School Di...

2020

[1] [1]

Chen, W., Chang, M.-W., Schlinger, E., Wang, W., and 9 LAKEQA : An Exploratory QA Benchmark over a Million-Scale Data Lake Cohen, W. W. Open question answering over tables and text.arXiv preprint arXiv:2010.10439, 2020a. Chen, W., Zha, H., Chen, Z., Xiong, W., Wang, H., and Wang, W. Y . Hybridqa: A dataset of multi-hop question answering over tabular and ...

arXiv 2010

[2] [2]

Lakebench: A benchmark for discovering joinable and unionable tables in data lakes

Deng, Y ., Chai, C., Li, L., Luo, S., Qin, Y ., Lian, J., and Li, G. Lakebench: A benchmark for discovering joinable and unionable tables in data lakes. InProceedings of the VLDB Endowment, volume 17, pp. 1925–1938,

1925

[3] [3]

Drop: A reading comprehension bench- mark requiring discrete reasoning over paragraphs.arXiv preprint arXiv:1903.00161,

Dua, D., Wang, Y ., Dasigi, P., Stanovsky, G., Singh, S., and Gardner, M. Drop: A reading comprehension bench- mark requiring discrete reasoning over paragraphs.arXiv preprint arXiv:1903.00161,

Pith/arXiv arXiv 1903

[4] [4]

F., Olston, C., Polyzotis, N., Roy, S., and Whang, S

Halevy, A., Korn, F., Noy, N. F., Olston, C., Polyzotis, N., Roy, S., and Whang, S. E. Goods: Organizing google’s datasets. InProceedings of the 2016 International Con- ference on Management of Data, pp. 795–806,

2016

[5] [5]

Characterizing deep research: A benchmark and formal definition.arXiv preprint arXiv:2508.04183,

Java, A., Khandelwal, A., Midigeshi, S., Halfaker, A., Desh- pande, A., Goyal, N., Gupta, A., Natarajan, N., and Sharma, A. Characterizing deep research: A benchmark and formal definition.arXiv preprint arXiv:2508.04183,

arXiv

[6] [6]

Dense passage retrieval for open-domain question answering

Karpukhin, V ., Oguz, B., Min, S., Lewis, P., Wu, L., Edunov, S., Chen, D., and Yih, W.-t. Dense passage retrieval for open-domain question answering. InProceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 6769–6781,

2020

[7] [7]

Looking beyond the surface: A challenge set for reading comprehension over multiple sentences

Khashabi, D., Chaturvedi, S., Roth, M., Upadhyay, S., and Roth, D. Looking beyond the surface: A challenge set for reading comprehension over multiple sentences. InPro- ceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pp. 252–262,

2018

[8] [8]

Mm-browsecomp: A comprehensive benchmark for multimodal browsing agents.arXiv preprint arXiv:2508.13186,

Li, S., Bu, X., Wang, W., Liu, J., Dong, J., He, H., Lu, H., Zhang, H., Jing, C., Li, Z., et al. Mm-browsecomp: A comprehensive benchmark for multimodal browsing agents.arXiv preprint arXiv:2508.13186,

arXiv

[9] [9]

Efficient and robust approximate nearest neighbor search using hierarchical navigable small world graphs

doi: 10.1109/TPAMI.2018.2889473. Microsoft. Overview of copilot in Fabric. https://le arn.microsoft.com/en-us/fabric/funda mentals/copilot-fabric-overview,

work page doi:10.1109/tpami.2018.2889473 2018

[10] [10]

J., Pu, K

Nargesian, F., Zhu, E., Miller, R. J., Pu, K. Q., and Arocena, P. C. Data lake management: challenges and opportu- nities.Proceedings of the VLDB Endowment, 12(12): 1986–1989,

1986

[11] [11]

Accessed: 2026-05-27. OpenAI. Inside OpenAI’s in-house data agent. https: //openai.com/index/inside-our-in-hou se-data-agent/,

2026

[12] [12]

Lancedb - embracing composability in the storage layer

Pace, W., She, C., Xu, L., Jones, W., Meng, R., and Cen, Y . Lancedb - embracing composability in the storage layer. InVLDB 2025 Workshop: Third International Workshop on Composable Data Management Systems,

2025

[13] [13]

Parikh, A

URL https://www.vldb.org/2025/Workshops/ VLDB-Workshops-2025/CDMS/CDMS25_15.p df. Parikh, A. P., Wang, X., Gehrmann, S., Faruqui, M., Dhin- gra, B., Yang, D., and Das, D. Totto: A controlled table- to-text generation dataset. InProceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1173–1186,

2025

[14] [14]

Know what you don’t know: Unanswerable questions for squad.arXiv preprint arXiv:1806.03822,

Rajpurkar, P., Jia, R., and Liang, P. Know what you don’t know: Unanswerable questions for squad.arXiv preprint arXiv:1806.03822,

Pith/arXiv arXiv

[15] [15]

The Web as a Knowledge-base for Answering Complex Questions

doi: 10.1561/1500000019. Talmor, A. and Berant, J. The web as a knowledge- base for answering complex questions.arXiv preprint arXiv:1803.06643,

work page internal anchor Pith review Pith/arXiv arXiv doi:10.1561/1500000019

[16] [16]

Multi- modalqa: Complex question answering over text, tables and images.arXiv preprint arXiv:2104.06039,

Talmor, A., Yoran, O., Catav, A., Lahav, D., Wang, Y ., Asai, A., Ilharco, G., Hajishirzi, H., and Berant, J. Multi- modalqa: Complex question answering over text, tables and images.arXiv preprint arXiv:2104.06039,

arXiv

[17] [17]

W., Passos, A

Wei, J., Sun, Z., Papay, S., McKinney, S., Han, J., Fulford, I., Chung, H. W., Passos, A. T., Fedus, W., and Glaese, A. Browsecomp: A simple yet challenging benchmark for browsing agents.arXiv preprint arXiv:2504.12516,

Pith/arXiv arXiv

[18] [18]

Yang, Z., Qi, P., Zhang, S., Bengio, Y ., Cohen, W., Salakhut- dinov, R., and Manning, C. D. Hotpotqa: A dataset for diverse, explainable multi-hop question answering. In Proceedings of the 2018 conference on empirical methods in natural language processing, pp. 2369–2380,

2018

[19] [19]

Autoddg: Automated dataset description generation using large language models.arXiv preprint arXiv:2502.01050, 2025a

Zhang, H., Liu, Y ., Santos, A., Hung, W.-L., and Freire, J. Autoddg: Automated dataset description generation using large language models.arXiv preprint arXiv:2502.01050, 2025a. Zhang, S. and Balog, K. Web table extraction, retrieval, and augmentation: A survey.ACM Transactions on Intelligent Systems and Technology, 11(2):1–35,

arXiv

[20] [20]

Qwen3 embedding: Advancing text embedding and reranking through foundation models.arXiv preprint arXiv:2506.05176, 2025b

Zhang, Y ., Li, M., Long, D., Zhang, X., Lin, H., Yang, B., Xie, P., Yang, A., Liu, D., Lin, J., Huang, F., and Zhou, J. Qwen3 embedding: Advancing text embedding and reranking through foundation models.arXiv preprint arXiv:2506.05176, 2025b. Zhu, E., Deng, D., Nargesian, F., and Miller, R. J. JOSIE: Overlap set similarity search for finding joinable tabl...

Pith/arXiv arXiv 2019

[21] [21]

[Text, Table, Image, Explicit Multi-hop Reasoning] MultiModalQA extends existing reading comprehension datasets such as Natural Questions (NQ) (Kwiatkowski et al., 2019), BoolQ (Clark et al., 2019), and HotpotQA (Yang et al.,

2019

[22] [22]

[Table, Free-form Response, Wikipedia] FETAQA extends existing table QA benchmarks whose answers are typically short answers evaluated by exact matching by introducing long, informative free-form answers grounded in a single Wikipedia table. To construct such questions, FETAQA starts from ToTTo (Parikh et al., 2020), a large-scale table-to-text dataset co...

2020

[23] [23]

result": ...} or {

Table 9.Distribution of benchmark tasks across data.gov theme categories. Theme Category Tasks Government & Admin 428 Environment 272 Transportation 54 Health & Social 324 Research & Demographics 35 Economy & Infrastructure 149 Public Safety 116 Education 179 F . Agent Interface Tool Implementations.All data access tools operate over a fixed S3 data lake ...

2020

[24] [24]

question

who is US president after 2000 , and in a hop, we take an intersection between the two subquestions: who is a Democratics US President after 2000 , as long as the answer to the intersection is correct, the task is fine – because eventually each task is consist of the final question and its answer, all subquestion and facts are just for the sake of creatin...

2000

[25] [25]

, "revision_subquestion

The intersection of all nodes in node 4 are still 12 districts. Yet, they somehow all transform into the 6 counties that contain those 12 districts any sources.", "revision_subquestion": "Add nodes 4 and 5 to convert the districts into counties. Here are the districts: Camas School DistrictCarbonado School DistrictColfax School DistrictDieringer School Di...

2020