PIPER: Content-Based Table Search via profiling and LLM-Generated Pseudoqueries

Matteo Falconi; Pierluigi Plebani; Riccardo Terrenzi; Serkan Ayvaz

arxiv: 2605.18199 · v1 · pith:DQGLFV5Tnew · submitted 2026-05-18 · 💻 cs.IR · cs.AI

PIPER: Content-Based Table Search via profiling and LLM-Generated Pseudoqueries

Riccardo Terrenzi , Matteo Falconi , Serkan Ayvaz , Pierluigi Plebani This is my paper

Pith reviewed 2026-05-20 00:23 UTC · model grok-4.3

classification 💻 cs.IR cs.AI

keywords tabular dataset searchcontent-based retrievalLLM pseudoqueriestable profilingdense retrievaldata lakesTableQAdataset ranking

0 comments

The pith

PIPER retrieves tabular datasets by profiling tables and embedding LLM-generated pseudoqueries for dense search.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces PIPER to handle search over tabular datasets in data lakes and similar collections where metadata is often incomplete or low-quality. It profiles each table to capture schema and cell values, then prompts an LLM to generate pseudoqueries that represent the table's meaning. These pseudoqueries are embedded into vectors so that user queries can be matched via dense retrieval and ranking. A sympathetic reader would care because this content-based approach promises better dataset discovery and reuse than metadata-only systems, especially as the volume of tabular data grows. The work demonstrates gains over both classical baselines and methods adapted from table question answering.

Core claim

PIPER is a content-driven retrieval method for tabular datasets that uses table profiles and LLM-generated queries embedded for dense retrieval, outperforming both classical metadata-based baselines and strong TableQA retrieval methods in poor-metadata settings.

What carries the argument

Table profiles combined with LLM-generated pseudoqueries embedded for dense retrieval; profiles summarize table content to guide the LLM in producing queries whose vectors enable semantic ranking of relevant datasets.

Load-bearing premise

LLM-generated pseudoqueries from table profiles produce embeddings that reliably capture table meaning for ranking purposes across diverse domains and table sizes.

What would settle it

A held-out test collection of tables and queries from an unseen domain where relevance judgments show metadata-only retrieval achieving higher precision at top-10 or top-20 than PIPER.

Figures

Figures reproduced from arXiv: 2605.18199 by Matteo Falconi, Pierluigi Plebani, Riccardo Terrenzi, Serkan Ayvaz.

**Figure 1.** Figure 1: Architecture of the offline phase of the proposed method. **Glucose**: Data is of type integer. There are 136 unique values. This column is numeric. Mean: 120.89453125, Max: 199, Min: 0. Coverage spans from 0 to 196.0. Listing 1.1. Snippet of statistical profile of a single column. where tij denotes the detected datatype, uij the number of distinct values, µ miss ij the missing-value information, γij the v… view at source ↗

**Figure 2.** Figure 2: Architecture of the online phase of the proposed method. Comprehensive diabetes datasets with 5+ years follow-up HbA1c levels treatment outcomes Listing 1.3. Examples of subquery obtained after query optimization. Query optimization User queries may be incomplete, ambiguous, or phrased with terminology that does not align directly with the indexed pseudoqueries. To improve retrieval, we apply a two-step LL… view at source ↗

**Figure 3.** Figure 3: 95% bootstrap confidence intervals for nDCG@10 on the NTCIR-15 tabular subset. Full stands for full PIPER system. QOpt stands for no query optimization ablation. When query optimization helps. The ablation results suggest that query optimization is not uniformly beneficial. On TARGET, where queries are already tightly aligned with the target tables, its effect is small and can even be slightly negative, l… view at source ↗

read the original abstract

The rapid growth of tabular datasets in data lakes, data spaces, and open data portals makes effective dataset search essential for reuse and analysis. Existing search systems rely mainly on metadata, which is often incomplete or low quality, especially for tables whose meaning depends on both schema and cell values. Recent advances in Large Language Models (LLMs) enable richer, content-based representations of tables. However, prior LLM-based retrieval methods have focused on Table Question Answering, where the goal is to select a single table to answer a question, rather than retrieve and rank relevant datasets. We propose PIPER, a content-driven retrieval method for tabular datasets that uses table profiles and LLM-generated queries embedded for dense retrieval. Designed for dataset search in poor-metadata settings, PIPER outperforms both classical metadata-based baselines and strong TableQA retrieval methods, demonstrating the value of LLM-based content modeling for tabular dataset search.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

PIPER uses table profiles to drive LLM pseudoqueries for dense dataset ranking, which is a reasonable practical step but rests on thin evidence so far.

read the letter

The paper's main move is to profile tables, have an LLM turn those profiles into pseudoqueries, embed the queries, and rank datasets by similarity. This targets search over data lakes and portals where metadata is spotty and the goal is to surface relevant tables rather than answer a single question against one table. That distinction from standard TableQA work is the clearest new angle here. The abstract frames it as outperforming both metadata baselines and existing TableQA retrieval approaches, which is a straightforward claim worth checking against real results. The approach itself is simple enough that it could be useful for practitioners who already have embedding pipelines and want to add content signals without full schema documentation. The stress-test concern about profiling large or mixed tables is worth watching. If the profile step samples or summarizes too aggressively, the generated queries can drop value distributions or cross-column relations, and the dense retrieval signal weakens. The abstract does not spell out how they handle scale or heterogeneity, so that part of the argument is not yet secured. No obvious circularity or invented entities show up in the positioning. The work is aimed at people building dataset search tools for analytics and AI reuse pipelines. A reader who needs concrete ideas for content-based ranking in poor-metadata settings would find the pipeline description and any ablations helpful. It is coherent on its own terms and shows honest engagement with the gap it claims to address. I would send it for peer review so the experiments and scaling behavior can be examined properly.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes PIPER, a content-driven retrieval method for tabular datasets that profiles tables and uses LLMs to generate pseudoqueries which are then embedded for dense retrieval. Designed for dataset search in poor-metadata settings, the paper claims that PIPER outperforms both classical metadata-based baselines and strong TableQA retrieval methods.

Significance. If the empirical results hold across the claimed range of domains and table sizes, the work would demonstrate a practical advance in content-based table search by showing that LLM-generated pseudoqueries can provide stronger signals than metadata or existing TableQA approaches for dataset ranking and reuse.

major comments (2)

[§4] §4 (Experiments): No scaling analysis is presented for tables that exceed typical LLM context windows or contain heterogeneous numeric/categorical mixes; without this, the claim that profiling plus pseudoquery embedding reliably captures table meaning for ranking across diverse domains and sizes remains untested and load-bearing for the central contribution.
[§3.2] §3.2 (Profiling and Pseudoquery Generation): The description of how row/column samples or summaries are constructed does not quantify information loss (e.g., omitted value distributions or inter-column relationships), which directly affects whether the resulting embeddings can be expected to outperform metadata baselines.

minor comments (2)

[Abstract] Abstract: The outperformance claim would be clearer if the abstract named the primary datasets, the embedding model, and the main ranking metric used.
[§3] Notation: The distinction between 'table profile' and 'pseudoquery' is introduced without a compact formal definition or diagram; a small figure or equation would aid readability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. The comments highlight important aspects of experimental validation and methodological description that we will address to strengthen the paper.

read point-by-point responses

Referee: [§4] §4 (Experiments): No scaling analysis is presented for tables that exceed typical LLM context windows or contain heterogeneous numeric/categorical mixes; without this, the claim that profiling plus pseudoquery embedding reliably captures table meaning for ranking across diverse domains and sizes remains untested and load-bearing for the central contribution.

Authors: We acknowledge that our experiments focus on tables fitting within standard LLM context windows and do not include dedicated scaling tests for very large tables or highly heterogeneous numeric/categorical mixes. The profiling step samples a fixed number of rows and columns (detailed in §3.2) precisely to avoid context limits and enable applicability to larger tables. To directly address this point, we will add scaling experiments in the revised §4 using both real-world large tables and controlled synthetic datasets that vary in size and heterogeneity, along with a discussion of how sampling preserves ranking signals. This will test and support the central claim across a broader range. revision: yes
Referee: [§3.2] §3.2 (Profiling and Pseudoquery Generation): The description of how row/column samples or summaries are constructed does not quantify information loss (e.g., omitted value distributions or inter-column relationships), which directly affects whether the resulting embeddings can be expected to outperform metadata baselines.

Authors: We agree that a more explicit treatment of information loss would improve the justification for the profiling approach. Section §3.2 currently describes the sampling of representative rows (via diversity-based selection) and LLM-based column summaries but does not quantify retained statistics such as value distributions or inter-column correlations. In the revision we will expand this section with details on the sampling criteria and include a quantitative analysis (e.g., comparing pre- and post-sampling distribution metrics on the evaluation datasets) to demonstrate that sufficient information is preserved for effective pseudoquery generation and superior retrieval performance over metadata baselines. revision: yes

Circularity Check

0 steps flagged

No significant circularity in PIPER derivation chain

full rationale

The paper introduces PIPER as a content-driven retrieval method that profiles tables and uses LLM-generated pseudoqueries for dense embedding-based ranking. This builds directly on established LLM embedding and retrieval techniques without any load-bearing step that reduces by definition, fitted parameter, or self-citation chain to the paper's own inputs. The central performance claim rests on empirical comparison against metadata baselines and TableQA methods rather than on a self-referential construction or renamed known result. No equations or sections exhibit self-definitional loops, uniqueness imported from prior author work, or ansatz smuggling; the approach remains externally falsifiable through standard retrieval metrics.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review; no free parameters, axioms, or invented entities are described in sufficient detail to populate the ledger.

pith-pipeline@v0.9.0 · 5690 in / 1100 out tokens · 34549 ms · 2026-05-20T00:23:10.360925+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We propose PIPER, a content-driven retrieval method for tabular datasets that uses table profiles and LLM-generated queries embedded for dense retrieval.
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

The statistical profile of Di is defined as Pi = Profile(Di) = {ϕ(c_i1), …, ϕ(c_im_i)} … for numerical columns … min, max, mean, and median.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

37 extracted references · 37 canonical work pages · 3 internal anchors

[1]

In: Proceedings of the twenty-fifth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems

Halevy, A., Franklin, M., Maier, D.: Principles of dataspace systems. In: Proceedings of the twenty-fifth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems. p. 1–9. ACM, Chicago IL USA (Jun 2006).https://doi.org/10.1145/1142351.1142352,https://dl.acm.org/doi/ 10.1145/1142351.1142352

work page doi:10.1145/1142351.1142352 2006
[2]

In: Proceedings of the 44th International ACM SIGIR Confer- ence on Research and Development in Information Retrieval

Kato, M.P., Ohshima, H., Liu, Y.-H., Chen, H.-L.: A test collection for ad-hoc dataset retrieval. In: Proceedings of the 44th International ACM SIGIR Confer- ence on Research and Development in Information Retrieval. p. 2450–2456. ACM, Virtual Event Canada (Jul 2021).https://doi.org/10.1145/3404835.3463261, https://dl.acm.org/doi/10.1145/3404835.3463261

work page doi:10.1145/3404835.3463261 2021
[3]

Zhang, H., Liu, Y., Hung, W.-L., Santos, A., Freire, J.: Autoddg: Automated dataset description generation using large language models (arXiv:2502.01050) (Feb 2025).https://doi.org/10.48550/arXiv.2502.01050,http://arxiv.org/ abs/2502.01050, arXiv:2502.01050 [cs]

work page doi:10.48550/arxiv.2502.01050 2025
[4]

IEEE Access13, 39510–39522 (2025).https://doi.org/10.1109/ACCESS.2025.3545387

Al-Qatf, M., Haque, R., Alsamhi, S.H., Buosi, S., Razzaq, M.A., Timilsina, M., Hawbani, A., Curry, E.: RAG4DS: Retrieval-Augmented Generation for Data Spaces—A Unified Lifecycle, Challenges, and Opportunities. IEEE Access13, 39510–39522 (2025).https://doi.org/10.1109/ACCESS.2025.3545387

work page doi:10.1109/access.2025.3545387 2025
[5]

In: 2020 IEEE 36th international Conference on Data Engineering (ICDE)

Bogatu, A., Fernandes, A.A., Paton, N.W., Konstantinou, N.: Dataset discovery in data lakes. In: 2020 IEEE 36th international Conference on Data Engineering (ICDE). pp. 709–720. IEEE (2020)

work page 2020
[6]

In: The World Wide Web Conference

Brickley, D., Burgess, M., Noy, N.: Google dataset search: Building a search engine for datasets in an open web ecosystem. In: The World Wide Web Conference. p. 1365–1375. WWW ’19, Association for Computing Machinery, New York, NY, USA(2019).https://doi.org/10.1145/3308558.3313685,https://doi.org/10. 1145/3308558.3313685

work page doi:10.1145/3308558.3313685 2019
[7]

Cafarella, M.J., Halevy, A., Khoussainova, N.: Data integration for the relational web2(1) (2009).https://doi.org/10.14778/1687627.1687750

work page doi:10.14778/1687627.1687750 2009
[8]

The VLDB Journal29(1), 251–272 (Jan 2020).https://doi.org/10.1007/s00778-019-00564-x 14 R

Chapman,A.,Simperl,E.,Koesten,L.,Konstantinidis,G.,Ibáñez,L.D.,Kacprzak, E., Groth, P.: Dataset search: a survey. The VLDB Journal29(1), 251–272 (Jan 2020).https://doi.org/10.1007/s00778-019-00564-x 14 R. Terrenzi et al

work page doi:10.1007/s00778-019-00564-x 2020
[9]

arXiv preprint arXiv:2010.10439 , year=

Chen, W., Chang, M.W., Schlinger, E., Wang, W., Cohen, W.W.: Open question answering over tables and text. arXiv preprint arXiv:2010.10439 (2020)

work page arXiv 2010
[10]

In: Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval

Chen, Z., Trabelsi, M., Heflin, J., Xu, Y., Davison, B.D.: Table search using a deep contextualized language model. In: Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval. pp. 589–598 (2020)

work page 2020
[11]

TechRxiv (April 2025).https://doi.org/10.36227/ techrxiv.174352282.22844759/v1

Cheng, M., Mao, Q., Liu, Q., Zhou, Y., Li, Y., Wang, J., Lin, J., Cao, J., Chen, E.: A survey on table mining with large language models: Challenges, ad- vancements and prospects. TechRxiv (April 2025).https://doi.org/10.36227/ techrxiv.174352282.22844759/v1

work page arXiv 2025
[12]

arXiv preprint arXiv:2201.09745 (2022)

Dong, H., Cheng, Z., He, X., Zhou, M., Zhou, A., Zhou, F., Liu, A., Han, S., Zhang, D.: Table pre-training: A survey on model architectures, pre-training objectives, and downstream tasks. arXiv preprint arXiv:2201.09745 (2022)

work page arXiv 2022
[13]

In: Proceedings of the 47th International ACM SI- GIR Conference on Research and Development in Information Retrieval

Dong, H., Wang, Z.: Large language models for tabular data: Progresses and future directions. In: Proceedings of the 47th International ACM SI- GIR Conference on Research and Development in Information Retrieval. p. 2997–3000. SIGIR ’24, Association for Computing Machinery, New York, NY, USA (2024).https://doi.org/10.1145/3626772.3661384,https://dl.acm.or...

work page doi:10.1145/3626772.3661384 2024
[14]

In: International Conference on Ad- vanced Information Systems Engineering

Falconi, M., Plebani, P.: Improving content-based data product retrieval in fed- erated environments with llm and sampling. In: International Conference on Ad- vanced Information Systems Engineering. pp. 289–297. Springer (2025)

work page 2025
[15]

In: 2024 IEEE International Conference on Big Data (BigData)

Fujita, Y., Hayashi, T., Kuwahara, M.: Inferring relationships between tabular data and topics using llm for a dataset search task. In: 2024 IEEE International Conference on Big Data (BigData). pp. 6564–6573. IEEE (2024)

work page 2024
[16]

IEEE Transactions on Knowledge and Data Engineering35(12), 12571– 12590 (2023)

Hai, R., Koutras, C., Quix, C., Jarke, M.: Data lakes: A survey of functions and systems. IEEE Transactions on Knowledge and Data Engineering35(12), 12571– 12590 (2023)

work page 2023
[17]

Emg-transnn-mha: A transformer-based model for enhanced motor intent recognition in assistive robotics,

Hayashi, T., Sakaji, H., Dai, J., Goebel, R.: Metadata-based data exploration with retrieval-augmented generation for large language models. In: 2024 IEEE Inter- national Conference on Big Data (BigData). p. 6574–6583 (Dec 2024).https: //doi.org/10.1109/BigData62323.2024.10826055,https://ieeexplore.ieee. org/abstract/document/10826055

work page doi:10.1109/bigdata62323.2024.10826055 2024
[18]

arXiv preprint arXiv:2505.11545 (2025)

Ji, X., Glenn, P., Parameswaran, A.G., Hulsebos, M.: Target: Benchmarking table retrieval for generative tasks. arXiv preprint arXiv:2505.11545 (2025)

work page arXiv 2025
[19]

Proceedings of the ACM on Management of Data2(3), 1–28 (2024)

Li, P., He, Y., Yashar, D., Cui, W., Ge, S., Zhang, H., Rifinski Fainman, D., Zhang, D., Chaudhuri, S.: Table-gpt: Table fine-tuned gpt for diverse table tasks. Proceedings of the ACM on Management of Data2(3), 1–28 (2024)

work page 2024
[20]

In: Chang, S., Hulsebos, M., Liu, Q., Chen, W., Sun, H

Liang, H.P., Chang, C.W., Fan, Y.C.: Improving table retrieval with question generation from partial tables. In: Chang, S., Hulsebos, M., Liu, Q., Chen, W., Sun, H. (eds.) Proceedings of the 4th Table Representation Learning Workshop. pp. 217–228. Association for Computational Linguistics, Vienna, Austria (Jul 2025).https://doi.org/10.18653/v1/2025.trl-1....

work page doi:10.18653/v1/2025.trl-1.19 2025
[21]

In: WebDB

Liu, J., Dong, X., Halevy, A.Y.: Answering structured queries on unstructured data. In: WebDB. vol. 6, pp. 25–30 (2006)

work page 2006
[22]

Transactions of the Association for Computational Linguistics10, 35– 49 (2022) PIPER 15

Nan, L., Hsieh, C., Mao, Z., Lin, X.V., Verma, N., Zhang, R., Kryściński, W., Schoelkopf, H., Kong, R., Tang, X., Mutuma, M., Rosand, B., Trindade, I., Ban- daru, R., Cunningham, J., Xiong, C., Radev, D.: Fetaqa: Free-form table question answering. Transactions of the Association for Computational Linguistics10, 35– 49 (2022) PIPER 15

work page 2022
[23]

In: Proceedings of the 37th International Conference on Scalable Scientific Data Management

Nandi, A., Chao, W.L., Qin, R., Boettiger, C., Lapp, H., Berger-Wolf, T.: Om- nimesh: Addressing findability challenges in distributed nature data repositories. In: Proceedings of the 37th International Conference on Scalable Scientific Data Management. pp. 1–6 (2025)

work page 2025
[24]

ACM Computing Surveys56(4), 1–37 (Apr 2024).https://doi.org/10.1145/3626521

Paton, N.W., Chen, J., Wu, Z.: Dataset discovery and exploration: A survey. ACM Computing Surveys56(4), 1–37 (Apr 2024).https://doi.org/10.1145/3626521

work page doi:10.1145/3626521 2024
[25]

Answering Table Queries on the Web using Column Keywords

Pimplikar, R., Sarawagi, S.: Answering table queries on the web using column keywords. arXiv preprint arXiv:1207.0132 (2012)

work page internal anchor Pith review Pith/arXiv arXiv 2012
[26]

Knowledge-based systems294, 111740 (2024)

Silva, L., Barbosa, L.: Improving dense retrieval models with LLM augmented data for dataset search. Knowledge-based systems294, 111740 (2024)

work page 2024
[27]

48550/arXiv.2503.09003,http://arxiv.org/abs/2503.09003

Singh, M., Kumar, A., Donaparthi, S., Karambelkar, G.: Leveraging Retrieval Augmented Generative LLMs For Automated Metadata Description Generation to Enhance Data Catalogs (arXiv:2503.09003) (Mar 2025).https://doi.org/10. 48550/arXiv.2503.09003,http://arxiv.org/abs/2503.09003

work page arXiv 2025
[28]

arXiv preprint ArXiv:2305.13062 (2023)

Sui, Y., Zhou, M., Zhou, M., Han, S., Zhang, D.: Gpt4table: Can large language models understand structured table data? a benchmark and empirical study. arXiv preprint ArXiv:2305.13062 (2023)

work page arXiv 2023
[29]

In: Proceedings of the ACM Web Con- ference 2022

Trabelsi, M., Chen, Z., Zhang, S., Davison, B.D., Heflin, J.: Strubert: Structure- aware bert for table search and matching. In: Proceedings of the ACM Web Con- ference 2022. pp. 442–451 (2022)

work page 2022
[30]

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, L., Polosukhin, I.: Attention Is All You Need (arXiv:1706.03762) (Aug 2023).https://doi.org/10.48550/arXiv.1706.03762,http://arxiv.org/abs/ 1706.03762

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.1706.03762 2023
[31]

In: Proceedings of the 44th In- ternational ACM SIGIR Conference on Research and Development in Information Retrieval

Wang, F., Sun, K., Chen, M., Pujara, J., Szekely, P.: Retrieving complex tables with multi-granular graph representation learning. In: Proceedings of the 44th In- ternational ACM SIGIR Conference on Research and Development in Information Retrieval. pp. 1472–1482 (2021)

work page 2021
[32]

Proceedings of the ACM on Management of Data1(4), 1–27 (2023)

Wang, Q., Castro Fernandez, R.: Solo: Data discovery using natural language ques- tions via a self-supervised approach. Proceedings of the ACM on Management of Data1(4), 1–27 (2023)

work page 2023
[33]

ACM SIGMOD Record45(2), 33–44 (Sep 2016).https://doi.org/10.1145/3003665.3003672

Wang, Y., Song, S., Chen, L.: A survey on accessing dataspaces. ACM SIGMOD Record45(2), 33–44 (Sep 2016).https://doi.org/10.1145/3003665.3003672

work page doi:10.1145/3003665.3003672 2016
[34]

Scientific data3(1), 1–9 (2016)

Wilkinson, M.D., Dumontier, M., Aalbersberg, I.J., Appleton, G., Axton, M., Baak, A., Blomberg, N., Boiten, J.-W., da Silva Santos, L.B., Bourne, P.E., Bouw- man, J., Brookes, A.J., Clark, T., Crosas, M., Dillo, I., Dumon, O., Edmunds, S., Evelo, C.T., Finkers, R., González-Beltrán, A., Gray, A.J.G., Groth, P., Goble, C., Grethe, J.S., Heringa, J., ’t Hoe...

work page 2016
[35]

arXiv preprint arXiv:2005.08314 (2020)

Yin, P., Neubig, G., Yih, W.t., Riedel, S.: TaBERT: Pretraining for joint under- standing of textual and tabular data. arXiv preprint arXiv:2005.08314 (2020)

work page arXiv 2005
[36]

In: Proceed- ings of the 2018 world wide web conference

Zhang, S., Balog, K.: Ad hoc table retrieval using semantic similarity. In: Proceed- ings of the 2018 world wide web conference. pp. 1553–1562 (2018)

work page 2018
[37]

Table Question Answering in the Era of Large Language Models: A Comprehensive Survey of Tasks, Methods, and Evaluation

Zhou, W., Ma, B., Friedrich, A., Mesgar, M.: Table question answering in the era of large language models: A comprehensive survey of tasks, methods, and evaluation. arXiv preprint arXiv:2510.09671 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025

[1] [1]

In: Proceedings of the twenty-fifth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems

Halevy, A., Franklin, M., Maier, D.: Principles of dataspace systems. In: Proceedings of the twenty-fifth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems. p. 1–9. ACM, Chicago IL USA (Jun 2006).https://doi.org/10.1145/1142351.1142352,https://dl.acm.org/doi/ 10.1145/1142351.1142352

work page doi:10.1145/1142351.1142352 2006

[2] [2]

In: Proceedings of the 44th International ACM SIGIR Confer- ence on Research and Development in Information Retrieval

Kato, M.P., Ohshima, H., Liu, Y.-H., Chen, H.-L.: A test collection for ad-hoc dataset retrieval. In: Proceedings of the 44th International ACM SIGIR Confer- ence on Research and Development in Information Retrieval. p. 2450–2456. ACM, Virtual Event Canada (Jul 2021).https://doi.org/10.1145/3404835.3463261, https://dl.acm.org/doi/10.1145/3404835.3463261

work page doi:10.1145/3404835.3463261 2021

[3] [3]

Zhang, H., Liu, Y., Hung, W.-L., Santos, A., Freire, J.: Autoddg: Automated dataset description generation using large language models (arXiv:2502.01050) (Feb 2025).https://doi.org/10.48550/arXiv.2502.01050,http://arxiv.org/ abs/2502.01050, arXiv:2502.01050 [cs]

work page doi:10.48550/arxiv.2502.01050 2025

[4] [4]

IEEE Access13, 39510–39522 (2025).https://doi.org/10.1109/ACCESS.2025.3545387

Al-Qatf, M., Haque, R., Alsamhi, S.H., Buosi, S., Razzaq, M.A., Timilsina, M., Hawbani, A., Curry, E.: RAG4DS: Retrieval-Augmented Generation for Data Spaces—A Unified Lifecycle, Challenges, and Opportunities. IEEE Access13, 39510–39522 (2025).https://doi.org/10.1109/ACCESS.2025.3545387

work page doi:10.1109/access.2025.3545387 2025

[5] [5]

In: 2020 IEEE 36th international Conference on Data Engineering (ICDE)

Bogatu, A., Fernandes, A.A., Paton, N.W., Konstantinou, N.: Dataset discovery in data lakes. In: 2020 IEEE 36th international Conference on Data Engineering (ICDE). pp. 709–720. IEEE (2020)

work page 2020

[6] [6]

In: The World Wide Web Conference

Brickley, D., Burgess, M., Noy, N.: Google dataset search: Building a search engine for datasets in an open web ecosystem. In: The World Wide Web Conference. p. 1365–1375. WWW ’19, Association for Computing Machinery, New York, NY, USA(2019).https://doi.org/10.1145/3308558.3313685,https://doi.org/10. 1145/3308558.3313685

work page doi:10.1145/3308558.3313685 2019

[7] [7]

Cafarella, M.J., Halevy, A., Khoussainova, N.: Data integration for the relational web2(1) (2009).https://doi.org/10.14778/1687627.1687750

work page doi:10.14778/1687627.1687750 2009

[8] [8]

The VLDB Journal29(1), 251–272 (Jan 2020).https://doi.org/10.1007/s00778-019-00564-x 14 R

Chapman,A.,Simperl,E.,Koesten,L.,Konstantinidis,G.,Ibáñez,L.D.,Kacprzak, E., Groth, P.: Dataset search: a survey. The VLDB Journal29(1), 251–272 (Jan 2020).https://doi.org/10.1007/s00778-019-00564-x 14 R. Terrenzi et al

work page doi:10.1007/s00778-019-00564-x 2020

[9] [9]

arXiv preprint arXiv:2010.10439 , year=

Chen, W., Chang, M.W., Schlinger, E., Wang, W., Cohen, W.W.: Open question answering over tables and text. arXiv preprint arXiv:2010.10439 (2020)

work page arXiv 2010

[10] [10]

In: Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval

Chen, Z., Trabelsi, M., Heflin, J., Xu, Y., Davison, B.D.: Table search using a deep contextualized language model. In: Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval. pp. 589–598 (2020)

work page 2020

[11] [11]

TechRxiv (April 2025).https://doi.org/10.36227/ techrxiv.174352282.22844759/v1

Cheng, M., Mao, Q., Liu, Q., Zhou, Y., Li, Y., Wang, J., Lin, J., Cao, J., Chen, E.: A survey on table mining with large language models: Challenges, ad- vancements and prospects. TechRxiv (April 2025).https://doi.org/10.36227/ techrxiv.174352282.22844759/v1

work page arXiv 2025

[12] [12]

arXiv preprint arXiv:2201.09745 (2022)

Dong, H., Cheng, Z., He, X., Zhou, M., Zhou, A., Zhou, F., Liu, A., Han, S., Zhang, D.: Table pre-training: A survey on model architectures, pre-training objectives, and downstream tasks. arXiv preprint arXiv:2201.09745 (2022)

work page arXiv 2022

[13] [13]

In: Proceedings of the 47th International ACM SI- GIR Conference on Research and Development in Information Retrieval

Dong, H., Wang, Z.: Large language models for tabular data: Progresses and future directions. In: Proceedings of the 47th International ACM SI- GIR Conference on Research and Development in Information Retrieval. p. 2997–3000. SIGIR ’24, Association for Computing Machinery, New York, NY, USA (2024).https://doi.org/10.1145/3626772.3661384,https://dl.acm.or...

work page doi:10.1145/3626772.3661384 2024

[14] [14]

In: International Conference on Ad- vanced Information Systems Engineering

Falconi, M., Plebani, P.: Improving content-based data product retrieval in fed- erated environments with llm and sampling. In: International Conference on Ad- vanced Information Systems Engineering. pp. 289–297. Springer (2025)

work page 2025

[15] [15]

In: 2024 IEEE International Conference on Big Data (BigData)

Fujita, Y., Hayashi, T., Kuwahara, M.: Inferring relationships between tabular data and topics using llm for a dataset search task. In: 2024 IEEE International Conference on Big Data (BigData). pp. 6564–6573. IEEE (2024)

work page 2024

[16] [16]

IEEE Transactions on Knowledge and Data Engineering35(12), 12571– 12590 (2023)

Hai, R., Koutras, C., Quix, C., Jarke, M.: Data lakes: A survey of functions and systems. IEEE Transactions on Knowledge and Data Engineering35(12), 12571– 12590 (2023)

work page 2023

[17] [17]

Emg-transnn-mha: A transformer-based model for enhanced motor intent recognition in assistive robotics,

Hayashi, T., Sakaji, H., Dai, J., Goebel, R.: Metadata-based data exploration with retrieval-augmented generation for large language models. In: 2024 IEEE Inter- national Conference on Big Data (BigData). p. 6574–6583 (Dec 2024).https: //doi.org/10.1109/BigData62323.2024.10826055,https://ieeexplore.ieee. org/abstract/document/10826055

work page doi:10.1109/bigdata62323.2024.10826055 2024

[18] [18]

arXiv preprint arXiv:2505.11545 (2025)

Ji, X., Glenn, P., Parameswaran, A.G., Hulsebos, M.: Target: Benchmarking table retrieval for generative tasks. arXiv preprint arXiv:2505.11545 (2025)

work page arXiv 2025

[19] [19]

Proceedings of the ACM on Management of Data2(3), 1–28 (2024)

Li, P., He, Y., Yashar, D., Cui, W., Ge, S., Zhang, H., Rifinski Fainman, D., Zhang, D., Chaudhuri, S.: Table-gpt: Table fine-tuned gpt for diverse table tasks. Proceedings of the ACM on Management of Data2(3), 1–28 (2024)

work page 2024

[20] [20]

In: Chang, S., Hulsebos, M., Liu, Q., Chen, W., Sun, H

Liang, H.P., Chang, C.W., Fan, Y.C.: Improving table retrieval with question generation from partial tables. In: Chang, S., Hulsebos, M., Liu, Q., Chen, W., Sun, H. (eds.) Proceedings of the 4th Table Representation Learning Workshop. pp. 217–228. Association for Computational Linguistics, Vienna, Austria (Jul 2025).https://doi.org/10.18653/v1/2025.trl-1....

work page doi:10.18653/v1/2025.trl-1.19 2025

[21] [21]

In: WebDB

Liu, J., Dong, X., Halevy, A.Y.: Answering structured queries on unstructured data. In: WebDB. vol. 6, pp. 25–30 (2006)

work page 2006

[22] [22]

Transactions of the Association for Computational Linguistics10, 35– 49 (2022) PIPER 15

Nan, L., Hsieh, C., Mao, Z., Lin, X.V., Verma, N., Zhang, R., Kryściński, W., Schoelkopf, H., Kong, R., Tang, X., Mutuma, M., Rosand, B., Trindade, I., Ban- daru, R., Cunningham, J., Xiong, C., Radev, D.: Fetaqa: Free-form table question answering. Transactions of the Association for Computational Linguistics10, 35– 49 (2022) PIPER 15

work page 2022

[23] [23]

In: Proceedings of the 37th International Conference on Scalable Scientific Data Management

Nandi, A., Chao, W.L., Qin, R., Boettiger, C., Lapp, H., Berger-Wolf, T.: Om- nimesh: Addressing findability challenges in distributed nature data repositories. In: Proceedings of the 37th International Conference on Scalable Scientific Data Management. pp. 1–6 (2025)

work page 2025

[24] [24]

ACM Computing Surveys56(4), 1–37 (Apr 2024).https://doi.org/10.1145/3626521

Paton, N.W., Chen, J., Wu, Z.: Dataset discovery and exploration: A survey. ACM Computing Surveys56(4), 1–37 (Apr 2024).https://doi.org/10.1145/3626521

work page doi:10.1145/3626521 2024

[25] [25]

Answering Table Queries on the Web using Column Keywords

Pimplikar, R., Sarawagi, S.: Answering table queries on the web using column keywords. arXiv preprint arXiv:1207.0132 (2012)

work page internal anchor Pith review Pith/arXiv arXiv 2012

[26] [26]

Knowledge-based systems294, 111740 (2024)

Silva, L., Barbosa, L.: Improving dense retrieval models with LLM augmented data for dataset search. Knowledge-based systems294, 111740 (2024)

work page 2024

[27] [27]

48550/arXiv.2503.09003,http://arxiv.org/abs/2503.09003

Singh, M., Kumar, A., Donaparthi, S., Karambelkar, G.: Leveraging Retrieval Augmented Generative LLMs For Automated Metadata Description Generation to Enhance Data Catalogs (arXiv:2503.09003) (Mar 2025).https://doi.org/10. 48550/arXiv.2503.09003,http://arxiv.org/abs/2503.09003

work page arXiv 2025

[28] [28]

arXiv preprint ArXiv:2305.13062 (2023)

Sui, Y., Zhou, M., Zhou, M., Han, S., Zhang, D.: Gpt4table: Can large language models understand structured table data? a benchmark and empirical study. arXiv preprint ArXiv:2305.13062 (2023)

work page arXiv 2023

[29] [29]

In: Proceedings of the ACM Web Con- ference 2022

Trabelsi, M., Chen, Z., Zhang, S., Davison, B.D., Heflin, J.: Strubert: Structure- aware bert for table search and matching. In: Proceedings of the ACM Web Con- ference 2022. pp. 442–451 (2022)

work page 2022

[30] [30]

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, L., Polosukhin, I.: Attention Is All You Need (arXiv:1706.03762) (Aug 2023).https://doi.org/10.48550/arXiv.1706.03762,http://arxiv.org/abs/ 1706.03762

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.1706.03762 2023

[31] [31]

In: Proceedings of the 44th In- ternational ACM SIGIR Conference on Research and Development in Information Retrieval

Wang, F., Sun, K., Chen, M., Pujara, J., Szekely, P.: Retrieving complex tables with multi-granular graph representation learning. In: Proceedings of the 44th In- ternational ACM SIGIR Conference on Research and Development in Information Retrieval. pp. 1472–1482 (2021)

work page 2021

[32] [32]

Proceedings of the ACM on Management of Data1(4), 1–27 (2023)

Wang, Q., Castro Fernandez, R.: Solo: Data discovery using natural language ques- tions via a self-supervised approach. Proceedings of the ACM on Management of Data1(4), 1–27 (2023)

work page 2023

[33] [33]

ACM SIGMOD Record45(2), 33–44 (Sep 2016).https://doi.org/10.1145/3003665.3003672

Wang, Y., Song, S., Chen, L.: A survey on accessing dataspaces. ACM SIGMOD Record45(2), 33–44 (Sep 2016).https://doi.org/10.1145/3003665.3003672

work page doi:10.1145/3003665.3003672 2016

[34] [34]

Scientific data3(1), 1–9 (2016)

Wilkinson, M.D., Dumontier, M., Aalbersberg, I.J., Appleton, G., Axton, M., Baak, A., Blomberg, N., Boiten, J.-W., da Silva Santos, L.B., Bourne, P.E., Bouw- man, J., Brookes, A.J., Clark, T., Crosas, M., Dillo, I., Dumon, O., Edmunds, S., Evelo, C.T., Finkers, R., González-Beltrán, A., Gray, A.J.G., Groth, P., Goble, C., Grethe, J.S., Heringa, J., ’t Hoe...

work page 2016

[35] [35]

arXiv preprint arXiv:2005.08314 (2020)

Yin, P., Neubig, G., Yih, W.t., Riedel, S.: TaBERT: Pretraining for joint under- standing of textual and tabular data. arXiv preprint arXiv:2005.08314 (2020)

work page arXiv 2005

[36] [36]

In: Proceed- ings of the 2018 world wide web conference

Zhang, S., Balog, K.: Ad hoc table retrieval using semantic similarity. In: Proceed- ings of the 2018 world wide web conference. pp. 1553–1562 (2018)

work page 2018

[37] [37]

Table Question Answering in the Era of Large Language Models: A Comprehensive Survey of Tasks, Methods, and Evaluation

Zhou, W., Ma, B., Friedrich, A., Mesgar, M.: Table question answering in the era of large language models: A comprehensive survey of tasks, methods, and evaluation. arXiv preprint arXiv:2510.09671 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025