Generative Retrieval for Table Union Search

Chenhao Ma; Linting Wang; Shulun Zhang; Yingli Zhou; Yuwei Xu

arxiv: 2607.00833 · v1 · pith:KNUY5Y7Dnew · submitted 2026-07-01 · 💻 cs.DB

Generative Retrieval for Table Union Search

Shulun Zhang , Linting Wang , Yuwei Xu , Yingli Zhou , Chenhao Ma This is my paper

Pith reviewed 2026-07-02 02:58 UTC · model grok-4.3

classification 💻 cs.DB

keywords table union searchgenerative retrievaldata lakestable discoveryconstrained decodingsemantic identifiersunionability

0 comments

The pith

GenTUS reformulates table union search as direct generation of unionability-aware identifiers rather than candidate retrieval followed by reranking.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes GenTUS to address the scaling problems of traditional table union search methods in data lakes. Those methods encode tables, retrieve a candidate pool, and then match or rerank, so quality depends on whether the pool contains the right tables and costs rise with lake size. GenTUS instead assigns each table a compact identifier that encodes unionability and trains a generator to output the identifiers of tables that can be unioned with a given query. Constrained decoding at query time guarantees only valid identifiers are produced. Experiments on seven public benchmarks show this yields the highest retrieval quality while lowering latency, storage, and update costs.

Core claim

GenTUS assigns candidate tables compact unionability-aware identifiers and trains a generator to produce the identifiers of unionable tables directly from the query. At query time, constrained decoding ensures that generated identifiers correspond to valid data-lake tables and returns them as ranked retrieval results. This replaces the encode-search-refine pipeline and removes dependence on candidate-pool recall.

What carries the argument

Constrained generation over discrete semantic table identifiers that encode unionability, allowing the model to output valid table identifiers directly instead of ranking from an explicit candidate pool.

If this is right

Retrieval quality no longer depends on the recall of an initial candidate pool.
Online latency drops because no search over growing candidate sets is performed.
Storage for retrieval artifacts shrinks since explicit indexes or embeddings are not required.
Incremental updates become cheaper because new tables do not force rebuilding of large retrieval structures.
Average rank of 1.05 across seven TUS benchmarks versus 2.57 for the strongest baseline.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The identifier design could be applied to other data-discovery tasks where direct generation is feasible instead of retrieval.
If identifier vocabulary grows too large, generation quality may degrade unless training data covers the full range of union patterns.
The approach might extend to column-level or schema-matching retrieval by redefining the identifiers accordingly.

Load-bearing premise

The generative model trained on unionability-aware identifiers will reliably produce identifiers for every relevant unionable table that exists in the lake.

What would settle it

A benchmark run in which GenTUS misses at least one table known to be unionable that a traditional full-candidate method retrieves, resulting in lower recall than the strongest baseline.

Figures

Figures reproduced from arXiv: 2607.00833 by Chenhao Ma, Linting Wang, Shulun Zhang, Yingli Zhou, Yuwei Xu.

**Figure 1.** Figure 1: An example of table union search. 1 Introduction Modern data lakes contain large collections of heterogeneous tables published by different organizations and curated under different schemas, metadata conventions, and quality standards [1, 9, 18, 32, 49]. To support downstream analysis, users often need to discover tables from a data lake that are relevant to a given analytical task. In many cases, the info… view at source ↗

**Figure 2.** Figure 2: Prior TUS methods vs. GenTUS. Prior works. As summarized in [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗

**Figure 3.** Figure 3: Overview of GenTUS. requirement through unionability-aware semantic identifiers and ranking-calibrated constrained generation. 4 Overview [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

**Figure 4.** Figure 4: Unionability-aware semantic identifier construc [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗

**Figure 5.** Figure 5: Precision@𝐾 and Recall@𝐾 as 𝐾 increases on the seven datasets. GenTUS maintains stronger precision–recall tradeoffs across different 𝐾 values. TUS Small TUS Large SANTOS Small SANTOS Large Wiki Union LakeBench 1K LakeBench 30K 100 1k 10k 100k Offline time (s, log scale) GenTUS TACTUS Starmie LIFTus Sherlock SATO [PITH_FULL_IMAGE:figures/full_fig_p010_5.png] view at source ↗

**Figure 6.** Figure 6: Offline build time (seconds, log scale), including the [PITH_FULL_IMAGE:figures/full_fig_p010_6.png] view at source ↗

**Figure 8.** Figure 8: Retrieval-artifact storage (log scale, MB): semantic [PITH_FULL_IMAGE:figures/full_fig_p010_8.png] view at source ↗

**Figure 9.** Figure 9: Incremental indexing. (a) MAP over D1–D5 with [PITH_FULL_IMAGE:figures/full_fig_p011_9.png] view at source ↗

**Figure 10.** Figure 10: Beam-width trade-off between MAP and mean [PITH_FULL_IMAGE:figures/full_fig_p012_10.png] view at source ↗

read the original abstract

Modern data lakes contain heterogeneous tables whose task-relevant information is often scattered across different schemas, sources, and naming conventions. Table union search (TUS) retrieves tables that can be reliably unioned with a query table, supporting data discovery, enrichment, and downstream analytics. Although learning-based TUS methods improve table- or column-level representations, they still follow an encode-search-refine pipeline: candidate retrieval is followed by query-candidate matching or reranking, making quality dependent on candidate-pool recall and incurring growing latency and storage costs as the data lake scales. We propose GenTUS, a generative retrieval framework that reformulates TUS as constrained generation over discrete semantic table identifiers. Instead of searching and reranking an explicit candidate pool, GenTUS assigns candidate tables compact unionability-aware identifiers and trains a generator to produce the identifiers of unionable tables directly from the query. At query time, constrained decoding ensures that generated identifiers correspond to valid data-lake tables and returns them as ranked retrieval results. Experiments on seven public TUS benchmarks show that GenTUS achieves the best overall retrieval quality, with an average rank of 1.05 compared to 2.57 for the strongest baseline, while substantially reducing online latency, retrieval-artifact storage, and incremental update cost.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

GenTUS recasts table union search as constrained generation of unionability-aware identifiers and claims top benchmark ranks plus efficiency gains, but the recall story needs verification.

read the letter

The main things here are that the paper applies generative retrieval to table union search by training a model to emit discrete identifiers instead of running encode-search-refine, and the experiments report an average rank of 1.05 across seven benchmarks versus 2.57 for the strongest baseline, along with lower latency, storage, and update costs.

What is new is the reformulation itself. Prior TUS work builds representations then retrieves from an explicit pool; this approach assigns compact identifiers that supposedly capture unionability, trains a generator on them, and uses constrained decoding to output only valid tables. That sidesteps candidate-pool maintenance and should help with scaling and incremental updates.

The paper does a reasonable job on the motivation and the efficiency logic. Removing the growing candidate set is a clean way to address the latency and storage problems that come with larger data lakes.

The soft spot is the recall question raised in the stress test. If the generator fails to produce identifiers for some relevant tables, especially infrequent unionability cases, the method just moves the recall problem inside the model rather than solving it. The abstract gives no concrete information on how the identifiers are built, what the training objective does for coverage, or any mechanism beyond validity constraints. The quantitative claims are presented without experimental setup details, baseline descriptions, or significance tests, so the 1.05 rank number is hard to evaluate on its own.

This is for people working on data lake discovery and retrieval methods. A reader who follows generative approaches to search would find the framing worth looking at. The core idea is coherent and the claims are falsifiable, so the paper deserves a serious referee.

Referee Report

2 major / 0 minor

Summary. The paper proposes GenTUS, a generative retrieval framework for table union search (TUS) that reformulates the task as constrained generation of compact unionability-aware table identifiers. Instead of an encode-search-refine pipeline, a model is trained to directly emit identifiers of unionable tables from a query table; constrained decoding at inference ensures validity and produces ranked results. Experiments on seven public TUS benchmarks are reported to show GenTUS achieving the best overall retrieval quality (average rank 1.05 vs. 2.57 for the strongest baseline) while reducing online latency, storage, and incremental update costs.

Significance. If the generative model reliably achieves high recall of all ground-truth unionable tables, the approach could meaningfully improve scalability for TUS in large heterogeneous data lakes by removing dependence on explicit candidate pools. The reported efficiency gains in latency, artifact storage, and update cost would be practically valuable if the quality claims hold under rigorous experimental controls.

major comments (2)

[Abstract] Abstract: the central claim of superior retrieval quality (avg. rank 1.05) rests on the generator emitting identifiers for essentially all relevant unionable tables. No information is provided on how unionability-aware identifiers are constructed, the training objective, coverage of rare unionability cases, or any mechanism (beyond validity constraints) that would guarantee the model does not omit relevant tables; if coverage is incomplete, the method reintroduces the recall problem it claims to solve.
[Abstract] Abstract: quantitative results are presented without any description of experimental setup, baselines, statistical significance testing, or dataset characteristics. This prevents assessment of whether the reported rank improvement is robust or whether the generative approach was fairly compared to encode-search-refine methods.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on the abstract. We address each major comment below and will revise the abstract to incorporate additional details while preserving its brevity.

read point-by-point responses

Referee: [Abstract] Abstract: the central claim of superior retrieval quality (avg. rank 1.05) rests on the generator emitting identifiers for essentially all relevant unionable tables. No information is provided on how unionability-aware identifiers are constructed, the training objective, coverage of rare unionability cases, or any mechanism (beyond validity constraints) that would guarantee the model does not omit relevant tables; if coverage is incomplete, the method reintroduces the recall problem it claims to solve.

Authors: The abstract is intentionally concise. The full manuscript provides these details in Section 3 (identifier construction from semantic table embeddings capturing unionability signals) and Section 4 (training objective as constrained seq2seq generation). The seven-benchmark evaluation shows GenTUS attaining the highest recall@K across all datasets, indicating effective coverage of unionable tables including less frequent cases; constrained decoding enforces validity but the learned distribution, not just constraints, drives recall. We will add one sentence to the abstract summarizing identifier construction and the training objective. revision: yes
Referee: [Abstract] Abstract: quantitative results are presented without any description of experimental setup, baselines, statistical significance testing, or dataset characteristics. This prevents assessment of whether the reported rank improvement is robust or whether the generative approach was fairly compared to encode-search-refine methods.

Authors: We agree the abstract omits these elements. The full paper (Section 5) describes the seven public TUS benchmarks, the encode-search-refine baselines, and reports statistical significance via paired t-tests on the rank improvements. We will revise the abstract to include a brief clause on the experimental setup and dataset scale. revision: yes

Circularity Check

0 steps flagged

No circularity; empirical claims rest on benchmark evaluation

full rationale

The paper presents GenTUS as a modeling reformulation of TUS into constrained generative retrieval over table identifiers, with all performance claims (average rank 1.05) grounded in direct experimental comparison against baselines on seven public benchmarks. No equations, training objectives, or central premises reduce by construction to fitted inputs, self-definitions, or self-citation chains; the derivation chain is self-contained as an engineering proposal whose validity is tested externally rather than assumed.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The paper introduces a new representation and generative process; details on parameters not available in abstract.

axioms (1)

domain assumption Semantic table identifiers can effectively encode unionability information for generative modeling.
Central to the framework's design as described in the abstract.

invented entities (1)

unionability-aware table identifiers no independent evidence
purpose: Compact representation for direct generation in retrieval
Introduced as part of the new framework.

pith-pipeline@v0.9.1-grok · 5759 in / 1064 out tokens · 53647 ms · 2026-07-02T02:58:11.122687+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

57 extracted references · 41 canonical work pages · 1 internal anchor

[1]

Ziawasch Abedjan, Mahdi Esmailoghli, and Sainyam Galhotra. 2025. Data Dis- covery in Data Lakes: Operations, Indexes, Systems.Proc. VLDB Endow.18, 12 (Aug. 2025), 5455–5459. doi:10.14778/3750601.3750694

work page doi:10.14778/3750601.3750694 2025
[2]

Gilbert Badaro, Mohammed Saeed, and Paolo Papotti. 2023. A Survey on Table Representation Learning.ACM/IMS J. Data Sci.1, 1 (2023), 2:1–2:55. doi:10.1145/ 3589777

2023
[3]

Michele Bevilacqua, Giuseppe Ottaviano, Patrick Lewis, Wen-tau Yih, Sebastian Riedel, and Fabio Petroni. 2022. Autoregressive search engines: generating sub- strings as document identifiers. InProceedings of the 36th International Conference on Neural Information Processing Systems(New Orleans, LA, USA)(NIPS ’22). Curran Associates Inc., Red Hook, NY, USA...

2022
[4]

Alex Bogatu, Alvaro A. A. Fernandes, Norman W. Paton, and Nikolaos Konstanti- nou. 2020. Dataset Discovery in Data Lakes. In36th IEEE International Conference on Data Engineering, ICDE 2020, Dallas, TX, USA, April 20-24, 2020. IEEE, 709–720. doi:10.1109/ICDE48307.2020.00067

work page doi:10.1109/icde48307.2020.00067 2020
[5]

Dan Brickley, Matthew Burgess, and Natasha F. Noy. 2019. Google Dataset Search: Building a Search Engine for Datasets in an Open Web Ecosystem. In The World Wide Web Conference, WWW 2019, San Francisco, CA, USA, May 13-17,

2019
[6]

doi:10.1145/3308558.3313685

ACM, 1365–1375. doi:10.1145/3308558.3313685

work page doi:10.1145/3308558.3313685
[7]

Cafarella, Alon Halevy, Daisy Zhe Wang, Eugene Wu, and Yang Zhang

Michael J. Cafarella, Alon Halevy, Daisy Zhe Wang, Eugene Wu, and Yang Zhang
[8]

VLDB Endow

WebTables: exploring the power of tables on the web.Proc. VLDB Endow. 1, 1 (Aug. 2008), 538–549. doi:10.14778/1453856.1453916

work page doi:10.14778/1453856.1453916 2008
[9]

Nicola De Cao, Gautier Izacard, Sebastian Riedel, and Fabio Petroni. 2021. Au- toregressive Entity Retrieval. In9th International Conference on Learning Rep- resentations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021. OpenReview.net. https://openreview.net/forum?id=5k8F6UU39V

2021
[10]

Sonia Castelo, Rémi Rampin, Aécio S. R. Santos, Aline Bessa, Fernando Chirigati, and Juliana Freire. 2021. Auctus: A Dataset Search Engine for Data Discovery and Augmentation.Proc. VLDB Endow.14, 12 (2021), 2791–2794. doi:10.14778/ 3476311.3476346

work page arXiv 2021
[11]

Raul Castro Fernandez, Ziawasch Abedjan, Famien Koko, Gina Yuan, Samuel Madden, and Michael Stonebraker. 2018. Aurum: A Data Discovery System. In 2018 IEEE 34th International Conference on Data Engineering (ICDE). 1001–1012. doi:10.1109/ICDE.2018.00094

work page doi:10.1109/icde.2018.00094 2018
[12]

Chengliang Chai, Yuhao Deng, Yutong Zhan, Ziqi Cao, Yuanfang Zhang, Lei Cao, Yu-Ping Wang, Zhiwei Zhang, Ye Yuan, Guoren Wang, and Nan Tang
[13]

VLDB Endow.17, 12 (2024), 4381–4384

LakeCompass: An End-to-End System for Table Maintenance, Search and Analysis in Data Lakes.Proc. VLDB Endow.17, 12 (2024), 4381–4384. doi:10. 14778/3685800.3685880

work page arXiv 2024
[14]

Nadiia Chepurko, Ryan Marcus, Emanuel Zgraggen, Raul Castro Fernandez, Tim Kraska, and David Karger. 2020. ARDA: automatic relational data augmentation for machine learning.Proc. VLDB Endow.13, 9 (May 2020), 1373–1387. doi:10. 14778/3397230.3397235

work page arXiv 2020
[15]

Tianji Cong, Fatemeh Nargesian, and H. V. Jagadish. 2023. Pylon: Semantic Table Union Search in Data Lakes.CoRRabs/2301.04901 (2023)

work page arXiv 2023
[16]

Xiang Deng, Huan Sun, Alyssa Lees, You Wu, and Cong Yu. 2020. TURL: table understanding through representation learning.Proc. VLDB Endow.14, 3 (Nov. 2020), 307–319. doi:10.14778/3430915.3430921

work page doi:10.14778/3430915.3430921 2020
[17]

Yuhao Deng, Chengliang Chai, Lei Cao, Qin Yuan, Siyuan Chen, Yanrui Yu, Zhaoze Sun, Junyi Wang, Jiajun Li, Ziqi Cao, Kaisen Jin, Chi Zhang, Yuqing Jiang, Yuanfang Zhang, Yuping Wang, Ye Yuan, Guoren Wang, and Nan Tang. 2024. LakeBench: A Benchmark for Discovering Joinable and Unionable Tables in Data Lakes.Proc. VLDB Endow.17, 8 (2024), 1925–1938. doi:10....

work page doi:10.14778/3659437.3659448 2024
[18]

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. arXiv:1810.04805 [cs.CL] https://arxiv.org/abs/1810.04805

work page internal anchor Pith review Pith/arXiv arXiv 2019
[19]

Yuyang Dong, Chuan Xiao, Takuma Nozawa, Masafumi Enomoto, and Masafumi Oyamada. 2023. DeepJoin: Joinable Table Discovery with Pre-Trained Language Models.Proc. VLDB Endow.16, 10 (June 2023), 2458–2470. doi:10.14778/3603581. 3603587

work page doi:10.14778/3603581 2023
[20]

Grace Fan, Jin Wang, Yuliang Li, Dan Zhang, and Renée J. Miller. 2023. Semantics- aware Dataset Discovery from Data Lakes with Contextualized Column-based Representation Learning.Proc. VLDB Endow.16, 7 (2023), 1726–1739. doi:10. 14778/3587136.3587146

work page arXiv 2023
[21]

Raul Castro Fernandez, Ziawasch Abedjan, Samuel Madden, and Michael Stone- braker. 2016. Towards large-scale data discovery: position paper. InProceedings of the Third International Workshop on Exploratory Search in Databases and the Web(San Francisco, California)(ExploreDB ’16). Association for Computing Machinery, New York, NY, USA, 3–5. doi:10.1145/294...

work page doi:10.1145/2948674.2948675 2016
[22]

Raul Castro Fernandez, Jisoo Min, Demitri Nava, and Samuel Madden. 2019. Lazo: A Cardinality-Based Method for Coupled Estimation of Jaccard Similarity and Containment. In35th IEEE International Conference on Data Engineering, ICDE 2019, Macao, China, April 8-11, 2019. IEEE, 1190–1201. doi:10.1109/ICDE.2019. 00109

work page doi:10.1109/icde.2019 2019
[23]

Yuxiang Guo, Zhonghao Hu, Yuren Mao, Baihua Zheng, Yunjun Gao, and Mingwei Zhou. 2025. BIRDIE: Natural Language-Driven Table Discovery Us- ing Differentiable Search Index.Proc. VLDB Endow.18, 7 (2025), 2070–2083. doi:10.14778/3734839.3734845

work page doi:10.14778/3734839.3734845 2025
[24]

Jonathan Herzig, Pawel Krzysztof Nowak, Thomas Müller, Francesco Piccinno, and Julian Martin Eisenschlos. 2020. TaPas: Weakly Supervised Table Parsing via Pre-training. InProceedings of the 58th Annual Meeting of the Association for Computational Linguistics, ACL 2020, Online, July 5-10, 2020, Dan Jurafsky, Joyce Chai, Natalie Schluter, and Joel R. Tetrea...

work page doi:10.18653/v1/2020.acl-main.398 2020
[25]

Xuming Hu, Shen Wang, Xiao Qin, Chuan Lei, Zhengyuan Shen, Christos Falout- sos, Asterios Katsifodimos, George Karypis, Lijie Wen, and Philip S. Yu. 2023. Automatic Table Union Search with Tabular Representation Learning. InFind- ings of the Association for Computational Linguistics: ACL 2023, Anna Rogers, Jordan Boyd-Graber, and Naoaki Okazaki (Eds.). As...

work page doi:10.18653/v1/2023.findings-acl.233 2023
[26]

Madelon Hulsebos, Kevin Hu, Michiel Bakker, Emanuel Zgraggen, Arvind Satya- narayan, Tim Kraska, Çagatay Demiralp, and César Hidalgo. 2019. Sherlock: A Deep Learning Approach to Semantic Data Type Detection. InProceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining(Anchorage, AK, USA)(KDD ’19). Association for Comp...

work page doi:10.1145/3292500.3330993 2019
[27]

Moe Kayali, Anton Lykov, Ilias Fountalis, Nikolaos Vasiloglou, Dan Olteanu, and Dan Suciu. 2024. CHORUS: Foundation Models for Unified Data Discovery and Exploration.Proc. VLDB Endow.17, 8 (2024), 2104–2114. doi:10.14778/3659437. 3659461

work page doi:10.14778/3659437 2024
[28]

Miller, and Mirek Riedewald

Aamod Khatiwada, Grace Fan, Roee Shraga, Zixuan Chen, Wolfgang Gatter- bauer, Renée J. Miller, and Mirek Riedewald. 2023. SANTOS: Relationship-based Semantic Table Union Search.Proc. ACM Manag. Data1, 1 (2023), 9:1–9:25. doi:10.1145/3588689

work page doi:10.1145/3588689 2023
[29]

Aamod Khatiwada, Roee Shraga, Wolfgang Gatterbauer, and Renée J. Miller
[30]

VLDB Endow.16, 4 (2022), 932–945

Integrating Data Lake Tables.Proc. VLDB Endow.16, 4 (2022), 932–945. doi:10.14778/3574245.3574274

work page doi:10.14778/3574245.3574274 2022
[31]

Prannay Khosla, Piotr Teterwak, Chen Wang, Aaron Sarna, Yonglong Tian, Phillip Isola, Aaron Maschinot, Ce Liu, and Dilip Krishnan. 2020. Supervised Contrastive Learning. InAdvances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual, Hugo Larochelle, Marc’...

2020
[32]

Christos Koutras, Kyriakos Psarakis, George Siachamis, Andra Ionescu, Marios Fragkoulis, Angela Bonifati, and Asterios Katsifodimos. 2021. Valentine in action: matching tabular data at scale.Proc. VLDB Endow.14, 12, 2871–2874. doi:10.14778/3476311.3476366

work page doi:10.14778/3476311.3476366 2021
[33]

Doyup Lee, Chiheon Kim, Saehoon Kim, Minsu Cho, and Wook-Shin Han. 2022. Autoregressive Image Generation using Residual Quantization. InIEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2022, New Orleans, LA, USA, June 18-24, 2022. IEEE, 11513–11522. doi:10.1109/CVPR52688.2022.01123

work page doi:10.1109/cvpr52688.2022.01123 2022
[34]

Girija Limaye, Sunita Sarawagi, and Soumen Chakrabarti. 2010. Annotating and searching web tables using entities, types and relationships.Proc. VLDB Endow. 3, 1–2 (Sept. 2010), 1338–1347. doi:10.14778/1920841.1921005

work page doi:10.14778/1920841.1921005 2010
[35]

Malkov and Dmitry A

Yury A. Malkov and Dmitry A. Yashunin. 2020. Efficient and Robust Approximate Nearest Neighbor Search Using Hierarchical Navigable Small World Graphs.IEEE Trans. Pattern Anal. Mach. Intell.42, 4 (2020), 824–836. doi:10.1109/TPAMI.2018. 2889473

work page doi:10.1109/tpami.2018 2020
[36]

Miller, Ken Q

Fatemeh Nargesian, Erkang Zhu, Renée J. Miller, Ken Q. Pu, and Patricia C. Arocena. 2019. Data Lake Management: Challenges and Opportunities.Proc. VLDB Endow.12, 12 (2019), 1986–1989. doi:10.14778/3352063.3352116

work page doi:10.14778/3352063.3352116 2019
[37]

Pu, and Renée J

Fatemeh Nargesian, Erkang Zhu, Ken Q. Pu, and Renée J. Miller. 2018. Table Union Search on Open Data.Proc. VLDB Endow.11, 7 (2018), 813–825. doi:10. 14778/3192965.3192973

work page arXiv 2018
[38]

Ermu Qiu, Jun Gao, Yaofeng Tu, and Jingru Yang. 2025. LIFTus: An Adaptive Multi-Aspect Column Representation Learning for Table Union Search. In41st IEEE International Conference on Data Engineering, ICDE 2025, Hong Kong, May 19-23, 2025. IEEE, 2174–2187. doi:10.1109/ICDE65448.2025.00165

work page doi:10.1109/icde65448.2025.00165 2025
[39]

Haohao Qu, Wenqi Fan, Zihuai Zhao, and Qing Li. 2025. TokenRec: Learning to Tokenize ID for LLM-Based Generative Recommendations.IEEE Trans. Knowl. Data Eng.37, 10 (2025), 6216–6231. doi:10.1109/TKDE.2025.3599265

work page doi:10.1109/tkde.2025.3599265 2025
[40]

Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu. 2020. Exploring the limits of transfer learning with a unified text-to-text transformer.J. Mach. Learn. Res.21, 1, Article 140 (Jan. 2020), 67 pages

2020
[41]

Tran, Jonah Samost, Maciej Kula, Ed H

Shashank Rajput, Nikhil Mehta, Anima Singh, Raghunandan Hulikal Kesha- van, Trung Vu, Lukasz Heldt, Lichan Hong, Yi Tay, Vinh Q. Tran, Jonah Samost, Maciej Kula, Ed H. Chi, and Mahesh Sathiamoorthy. 2023. Rec- ommender Systems with Generative Retrieval. InAdvances in Neural Infor- mation Processing Systems 36: Annual Conference on Neural Information Pro- ...

2023
[42]

Yongkang Sun, Zhihao Ding, Huiqiang Wang, Reynold Cheng, and Jieming Shi
[43]

arXiv preprint arXiv:2603.17298(2026)

Efficient and Effective Table-Centric Table Union Search in Data Lakes. arXiv preprint arXiv:2603.17298(2026)

work page arXiv 2026
[44]

Cohen, and Donald Metzler

Yi Tay, Vinh Tran, Mostafa Dehghani, Jianmo Ni, Dara Bahri, Harsh Mehta, Zhen Qin, Kai Hui, Zhe Zhao, Jai Prakash Gupta, Tal Schuster, William W. Cohen, and Donald Metzler. 2022. Transformer Memory as a Differentiable Search Index. InAdvances in Neural Information Processing Systems 35: Annual Conference on Neural Information Processing Systems 2022, Neur...

2022
[45]

Aäron van den Oord, Oriol Vinyals, and Koray Kavukcuoglu. 2017. Neural Discrete Representation Learning. InAdvances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, December 4-9, 2017, Long Beach, CA, USA. 6306–6315. https://proceedings.neurips. cc/paper/2017/hash/7a98af17e63a0ac09ce2e96d03992fb...

2017
[46]

Wenjie Wang, Honghui Bao, Xinyu Lin, Jizhi Zhang, Yongqi Li, Fuli Feng, See- Kiong Ng, and Tat-Seng Chua. 2024. Learnable Item Tokenization for Generative Recommendation. InProceedings of the 33rd ACM International Conference on Information and Knowledge Management, CIKM 2024, Boise, ID, USA, October 21-25, 2024, Edoardo Serra and Francesca Spezzano (Eds....

work page doi:10.1145/3627673.3679569 2024
[47]

Yujing Wang, Yingyan Hou, Haonan Wang, Ziming Miao, Shibin Wu, Qi Chen, Yuqing Xia, Chengmin Chi, Guoshuai Zhao, Zheng Liu, Xing Xie, Hao Sun, Weiwei Deng, Qi Zhang, and Mao Yang. 2022. A Neural Cor- pus Indexer for Document Retrieval. InAdvances in Neural Information Processing Systems 35: Annual Conference on Neural Information Processing Systems 2022, ...

2022
[48]

Sam Wiseman and Alexander M. Rush. 2016. Sequence-to-Sequence Learn- ing as Beam-Search Optimization. InProceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, EMNLP 2016, Austin, Texas, USA, November 1-4, 2016. Association for Computational Linguistics, 1296–1306. doi:10.18653/v1/d16-1137

work page doi:10.18653/v1/d16-1137 2016
[49]

Lee Xiong, Chenyan Xiong, Ye Li, Kwok-Fung Tang, Jialin Liu, Paul Bennett, Junaid Ahmed, and Arnold Overwijk. 2020. Approximate nearest neighbor nega- tive contrastive learning for dense text retrieval.arXiv preprint arXiv:2007.00808 (2020)

work page arXiv 2020
[50]

Mohamed Yakout, Kris Ganjam, Kaushik Chakrabarti, and Surajit Chaudhuri
[51]

InProceedings of the ACM SIGMOD International Conference on Management of Data, SIGMOD 2012, Scottsdale, AZ, USA, May 20-24, 2012, K

InfoGather: entity augmentation and attribute discovery by holistic match- ing with web tables. InProceedings of the ACM SIGMOD International Conference on Management of Data, SIGMOD 2012, Scottsdale, AZ, USA, May 20-24, 2012, K. Selçuk Candan, Yi Chen, Richard T. Snodgrass, Luis Gravano, and Ariel Fux- man (Eds.). ACM, 97–108. doi:10.1145/2213836.2213848

work page doi:10.1145/2213836.2213848 2012
[52]

Pengcheng Yin, Graham Neubig, Wen-tau Yih, and Sebastian Riedel. 2020. TaBERT: Pretraining for Joint Understanding of Textual and Tabular Data. In Proceedings of the 58th Annual Meeting of the Association for Computational Lin- guistics, ACL 2020, Online, July 5-10, 2020, Dan Jurafsky, Joyce Chai, Natalie Schluter, and Joel R. Tetreault (Eds.). Associatio...

work page doi:10.18653/v1/2020.acl-main.745 2020
[53]

Jiaqi Zhai, Lucy Liao, Xing Liu, Yueming Wang, Rui Li, Xuan Cao, Leon Gao, Zhaojie Gong, Fangda Gu, Jiayuan He, Yinghai Lu, and Yu Shi. 2024. Actions speak louder than words: trillion-parameter sequential transducers for generative recommendations. 26 pages

2024
[54]

Dan Zhang, Madelon Hulsebos, Yoshihiko Suhara, Çağatay Demiralp, Jinfeng Li, and Wang-Chiew Tan. 2020. Sato: contextual semantic type detection in tables. Proc. VLDB Endow.13, 12 (July 2020), 1835–1848. doi:10.14778/3407790.3407793

work page doi:10.14778/3407790.3407793 2020
[55]

Yi Zhang and Zachary G. Ives. 2020. Finding Related Tables in Data Lakes for Interactive Data Science. InProceedings of the 2020 International Conference on Management of Data, SIGMOD Conference 2020, online conference [Portland, OR, USA], June 14-19, 2020, David Maier, Rachel Pottinger, AnHai Doan, Wang- Chiew Tan, Abdussalam Alawini, and Hung Q. Ngo (Ed...

work page doi:10.1145/3318464.3389726 2020
[56]

Bowen Zheng, Yupeng Hou, Hongyu Lu, Yu Chen, Wayne Xin Zhao, Ming Chen, and Ji-Rong Wen. 2024. Adapting Large Language Models by Integrating Collab- orative Semantics for Recommendation. In40th IEEE International Conference on Data Engineering, ICDE 2024, Utrecht, The Netherlands, May 13-16, 2024. IEEE, 1435–1448. doi:10.1109/ICDE60146.2024.00118

work page doi:10.1109/icde60146.2024.00118 2024
[57]

Erkang Zhu, Dong Deng, Fatemeh Nargesian, and Renée J. Miller. 2019. JOSIE: Overlap Set Similarity Search for Finding Joinable Tables in Data Lakes. InPro- ceedings of the 2019 International Conference on Management of Data(Amsterdam, Netherlands)(SIGMOD ’19). Association for Computing Machinery, New York, NY, USA, 847–864. doi:10.1145/3299869.3300065 14

work page doi:10.1145/3299869.3300065 2019

[1] [1]

Ziawasch Abedjan, Mahdi Esmailoghli, and Sainyam Galhotra. 2025. Data Dis- covery in Data Lakes: Operations, Indexes, Systems.Proc. VLDB Endow.18, 12 (Aug. 2025), 5455–5459. doi:10.14778/3750601.3750694

work page doi:10.14778/3750601.3750694 2025

[2] [2]

Gilbert Badaro, Mohammed Saeed, and Paolo Papotti. 2023. A Survey on Table Representation Learning.ACM/IMS J. Data Sci.1, 1 (2023), 2:1–2:55. doi:10.1145/ 3589777

2023

[3] [3]

Michele Bevilacqua, Giuseppe Ottaviano, Patrick Lewis, Wen-tau Yih, Sebastian Riedel, and Fabio Petroni. 2022. Autoregressive search engines: generating sub- strings as document identifiers. InProceedings of the 36th International Conference on Neural Information Processing Systems(New Orleans, LA, USA)(NIPS ’22). Curran Associates Inc., Red Hook, NY, USA...

2022

[4] [4]

Alex Bogatu, Alvaro A. A. Fernandes, Norman W. Paton, and Nikolaos Konstanti- nou. 2020. Dataset Discovery in Data Lakes. In36th IEEE International Conference on Data Engineering, ICDE 2020, Dallas, TX, USA, April 20-24, 2020. IEEE, 709–720. doi:10.1109/ICDE48307.2020.00067

work page doi:10.1109/icde48307.2020.00067 2020

[5] [5]

Dan Brickley, Matthew Burgess, and Natasha F. Noy. 2019. Google Dataset Search: Building a Search Engine for Datasets in an Open Web Ecosystem. In The World Wide Web Conference, WWW 2019, San Francisco, CA, USA, May 13-17,

2019

[6] [6]

doi:10.1145/3308558.3313685

ACM, 1365–1375. doi:10.1145/3308558.3313685

work page doi:10.1145/3308558.3313685

[7] [7]

Cafarella, Alon Halevy, Daisy Zhe Wang, Eugene Wu, and Yang Zhang

Michael J. Cafarella, Alon Halevy, Daisy Zhe Wang, Eugene Wu, and Yang Zhang

[8] [8]

VLDB Endow

WebTables: exploring the power of tables on the web.Proc. VLDB Endow. 1, 1 (Aug. 2008), 538–549. doi:10.14778/1453856.1453916

work page doi:10.14778/1453856.1453916 2008

[9] [9]

Nicola De Cao, Gautier Izacard, Sebastian Riedel, and Fabio Petroni. 2021. Au- toregressive Entity Retrieval. In9th International Conference on Learning Rep- resentations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021. OpenReview.net. https://openreview.net/forum?id=5k8F6UU39V

2021

[10] [10]

Sonia Castelo, Rémi Rampin, Aécio S. R. Santos, Aline Bessa, Fernando Chirigati, and Juliana Freire. 2021. Auctus: A Dataset Search Engine for Data Discovery and Augmentation.Proc. VLDB Endow.14, 12 (2021), 2791–2794. doi:10.14778/ 3476311.3476346

work page arXiv 2021

[11] [11]

Raul Castro Fernandez, Ziawasch Abedjan, Famien Koko, Gina Yuan, Samuel Madden, and Michael Stonebraker. 2018. Aurum: A Data Discovery System. In 2018 IEEE 34th International Conference on Data Engineering (ICDE). 1001–1012. doi:10.1109/ICDE.2018.00094

work page doi:10.1109/icde.2018.00094 2018

[12] [12]

Chengliang Chai, Yuhao Deng, Yutong Zhan, Ziqi Cao, Yuanfang Zhang, Lei Cao, Yu-Ping Wang, Zhiwei Zhang, Ye Yuan, Guoren Wang, and Nan Tang

[13] [13]

VLDB Endow.17, 12 (2024), 4381–4384

LakeCompass: An End-to-End System for Table Maintenance, Search and Analysis in Data Lakes.Proc. VLDB Endow.17, 12 (2024), 4381–4384. doi:10. 14778/3685800.3685880

work page arXiv 2024

[14] [14]

Nadiia Chepurko, Ryan Marcus, Emanuel Zgraggen, Raul Castro Fernandez, Tim Kraska, and David Karger. 2020. ARDA: automatic relational data augmentation for machine learning.Proc. VLDB Endow.13, 9 (May 2020), 1373–1387. doi:10. 14778/3397230.3397235

work page arXiv 2020

[15] [15]

Tianji Cong, Fatemeh Nargesian, and H. V. Jagadish. 2023. Pylon: Semantic Table Union Search in Data Lakes.CoRRabs/2301.04901 (2023)

work page arXiv 2023

[16] [16]

Xiang Deng, Huan Sun, Alyssa Lees, You Wu, and Cong Yu. 2020. TURL: table understanding through representation learning.Proc. VLDB Endow.14, 3 (Nov. 2020), 307–319. doi:10.14778/3430915.3430921

work page doi:10.14778/3430915.3430921 2020

[17] [17]

Yuhao Deng, Chengliang Chai, Lei Cao, Qin Yuan, Siyuan Chen, Yanrui Yu, Zhaoze Sun, Junyi Wang, Jiajun Li, Ziqi Cao, Kaisen Jin, Chi Zhang, Yuqing Jiang, Yuanfang Zhang, Yuping Wang, Ye Yuan, Guoren Wang, and Nan Tang. 2024. LakeBench: A Benchmark for Discovering Joinable and Unionable Tables in Data Lakes.Proc. VLDB Endow.17, 8 (2024), 1925–1938. doi:10....

work page doi:10.14778/3659437.3659448 2024

[18] [18]

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. arXiv:1810.04805 [cs.CL] https://arxiv.org/abs/1810.04805

work page internal anchor Pith review Pith/arXiv arXiv 2019

[19] [19]

Yuyang Dong, Chuan Xiao, Takuma Nozawa, Masafumi Enomoto, and Masafumi Oyamada. 2023. DeepJoin: Joinable Table Discovery with Pre-Trained Language Models.Proc. VLDB Endow.16, 10 (June 2023), 2458–2470. doi:10.14778/3603581. 3603587

work page doi:10.14778/3603581 2023

[20] [20]

Grace Fan, Jin Wang, Yuliang Li, Dan Zhang, and Renée J. Miller. 2023. Semantics- aware Dataset Discovery from Data Lakes with Contextualized Column-based Representation Learning.Proc. VLDB Endow.16, 7 (2023), 1726–1739. doi:10. 14778/3587136.3587146

work page arXiv 2023

[21] [21]

Raul Castro Fernandez, Ziawasch Abedjan, Samuel Madden, and Michael Stone- braker. 2016. Towards large-scale data discovery: position paper. InProceedings of the Third International Workshop on Exploratory Search in Databases and the Web(San Francisco, California)(ExploreDB ’16). Association for Computing Machinery, New York, NY, USA, 3–5. doi:10.1145/294...

work page doi:10.1145/2948674.2948675 2016

[22] [22]

Raul Castro Fernandez, Jisoo Min, Demitri Nava, and Samuel Madden. 2019. Lazo: A Cardinality-Based Method for Coupled Estimation of Jaccard Similarity and Containment. In35th IEEE International Conference on Data Engineering, ICDE 2019, Macao, China, April 8-11, 2019. IEEE, 1190–1201. doi:10.1109/ICDE.2019. 00109

work page doi:10.1109/icde.2019 2019

[23] [23]

Yuxiang Guo, Zhonghao Hu, Yuren Mao, Baihua Zheng, Yunjun Gao, and Mingwei Zhou. 2025. BIRDIE: Natural Language-Driven Table Discovery Us- ing Differentiable Search Index.Proc. VLDB Endow.18, 7 (2025), 2070–2083. doi:10.14778/3734839.3734845

work page doi:10.14778/3734839.3734845 2025

[24] [24]

Jonathan Herzig, Pawel Krzysztof Nowak, Thomas Müller, Francesco Piccinno, and Julian Martin Eisenschlos. 2020. TaPas: Weakly Supervised Table Parsing via Pre-training. InProceedings of the 58th Annual Meeting of the Association for Computational Linguistics, ACL 2020, Online, July 5-10, 2020, Dan Jurafsky, Joyce Chai, Natalie Schluter, and Joel R. Tetrea...

work page doi:10.18653/v1/2020.acl-main.398 2020

[25] [25]

Xuming Hu, Shen Wang, Xiao Qin, Chuan Lei, Zhengyuan Shen, Christos Falout- sos, Asterios Katsifodimos, George Karypis, Lijie Wen, and Philip S. Yu. 2023. Automatic Table Union Search with Tabular Representation Learning. InFind- ings of the Association for Computational Linguistics: ACL 2023, Anna Rogers, Jordan Boyd-Graber, and Naoaki Okazaki (Eds.). As...

work page doi:10.18653/v1/2023.findings-acl.233 2023

[26] [26]

Madelon Hulsebos, Kevin Hu, Michiel Bakker, Emanuel Zgraggen, Arvind Satya- narayan, Tim Kraska, Çagatay Demiralp, and César Hidalgo. 2019. Sherlock: A Deep Learning Approach to Semantic Data Type Detection. InProceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining(Anchorage, AK, USA)(KDD ’19). Association for Comp...

work page doi:10.1145/3292500.3330993 2019

[27] [27]

Moe Kayali, Anton Lykov, Ilias Fountalis, Nikolaos Vasiloglou, Dan Olteanu, and Dan Suciu. 2024. CHORUS: Foundation Models for Unified Data Discovery and Exploration.Proc. VLDB Endow.17, 8 (2024), 2104–2114. doi:10.14778/3659437. 3659461

work page doi:10.14778/3659437 2024

[28] [28]

Miller, and Mirek Riedewald

Aamod Khatiwada, Grace Fan, Roee Shraga, Zixuan Chen, Wolfgang Gatter- bauer, Renée J. Miller, and Mirek Riedewald. 2023. SANTOS: Relationship-based Semantic Table Union Search.Proc. ACM Manag. Data1, 1 (2023), 9:1–9:25. doi:10.1145/3588689

work page doi:10.1145/3588689 2023

[29] [29]

Aamod Khatiwada, Roee Shraga, Wolfgang Gatterbauer, and Renée J. Miller

[30] [30]

VLDB Endow.16, 4 (2022), 932–945

Integrating Data Lake Tables.Proc. VLDB Endow.16, 4 (2022), 932–945. doi:10.14778/3574245.3574274

work page doi:10.14778/3574245.3574274 2022

[31] [31]

Prannay Khosla, Piotr Teterwak, Chen Wang, Aaron Sarna, Yonglong Tian, Phillip Isola, Aaron Maschinot, Ce Liu, and Dilip Krishnan. 2020. Supervised Contrastive Learning. InAdvances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual, Hugo Larochelle, Marc’...

2020

[32] [32]

Christos Koutras, Kyriakos Psarakis, George Siachamis, Andra Ionescu, Marios Fragkoulis, Angela Bonifati, and Asterios Katsifodimos. 2021. Valentine in action: matching tabular data at scale.Proc. VLDB Endow.14, 12, 2871–2874. doi:10.14778/3476311.3476366

work page doi:10.14778/3476311.3476366 2021

[33] [33]

Doyup Lee, Chiheon Kim, Saehoon Kim, Minsu Cho, and Wook-Shin Han. 2022. Autoregressive Image Generation using Residual Quantization. InIEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2022, New Orleans, LA, USA, June 18-24, 2022. IEEE, 11513–11522. doi:10.1109/CVPR52688.2022.01123

work page doi:10.1109/cvpr52688.2022.01123 2022

[34] [34]

Girija Limaye, Sunita Sarawagi, and Soumen Chakrabarti. 2010. Annotating and searching web tables using entities, types and relationships.Proc. VLDB Endow. 3, 1–2 (Sept. 2010), 1338–1347. doi:10.14778/1920841.1921005

work page doi:10.14778/1920841.1921005 2010

[35] [35]

Malkov and Dmitry A

Yury A. Malkov and Dmitry A. Yashunin. 2020. Efficient and Robust Approximate Nearest Neighbor Search Using Hierarchical Navigable Small World Graphs.IEEE Trans. Pattern Anal. Mach. Intell.42, 4 (2020), 824–836. doi:10.1109/TPAMI.2018. 2889473

work page doi:10.1109/tpami.2018 2020

[36] [36]

Miller, Ken Q

Fatemeh Nargesian, Erkang Zhu, Renée J. Miller, Ken Q. Pu, and Patricia C. Arocena. 2019. Data Lake Management: Challenges and Opportunities.Proc. VLDB Endow.12, 12 (2019), 1986–1989. doi:10.14778/3352063.3352116

work page doi:10.14778/3352063.3352116 2019

[37] [37]

Pu, and Renée J

Fatemeh Nargesian, Erkang Zhu, Ken Q. Pu, and Renée J. Miller. 2018. Table Union Search on Open Data.Proc. VLDB Endow.11, 7 (2018), 813–825. doi:10. 14778/3192965.3192973

work page arXiv 2018

[38] [38]

Ermu Qiu, Jun Gao, Yaofeng Tu, and Jingru Yang. 2025. LIFTus: An Adaptive Multi-Aspect Column Representation Learning for Table Union Search. In41st IEEE International Conference on Data Engineering, ICDE 2025, Hong Kong, May 19-23, 2025. IEEE, 2174–2187. doi:10.1109/ICDE65448.2025.00165

work page doi:10.1109/icde65448.2025.00165 2025

[39] [39]

Haohao Qu, Wenqi Fan, Zihuai Zhao, and Qing Li. 2025. TokenRec: Learning to Tokenize ID for LLM-Based Generative Recommendations.IEEE Trans. Knowl. Data Eng.37, 10 (2025), 6216–6231. doi:10.1109/TKDE.2025.3599265

work page doi:10.1109/tkde.2025.3599265 2025

[40] [40]

Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu. 2020. Exploring the limits of transfer learning with a unified text-to-text transformer.J. Mach. Learn. Res.21, 1, Article 140 (Jan. 2020), 67 pages

2020

[41] [41]

Tran, Jonah Samost, Maciej Kula, Ed H

Shashank Rajput, Nikhil Mehta, Anima Singh, Raghunandan Hulikal Kesha- van, Trung Vu, Lukasz Heldt, Lichan Hong, Yi Tay, Vinh Q. Tran, Jonah Samost, Maciej Kula, Ed H. Chi, and Mahesh Sathiamoorthy. 2023. Rec- ommender Systems with Generative Retrieval. InAdvances in Neural Infor- mation Processing Systems 36: Annual Conference on Neural Information Pro- ...

2023

[42] [42]

Yongkang Sun, Zhihao Ding, Huiqiang Wang, Reynold Cheng, and Jieming Shi

[43] [43]

arXiv preprint arXiv:2603.17298(2026)

Efficient and Effective Table-Centric Table Union Search in Data Lakes. arXiv preprint arXiv:2603.17298(2026)

work page arXiv 2026

[44] [44]

Cohen, and Donald Metzler

Yi Tay, Vinh Tran, Mostafa Dehghani, Jianmo Ni, Dara Bahri, Harsh Mehta, Zhen Qin, Kai Hui, Zhe Zhao, Jai Prakash Gupta, Tal Schuster, William W. Cohen, and Donald Metzler. 2022. Transformer Memory as a Differentiable Search Index. InAdvances in Neural Information Processing Systems 35: Annual Conference on Neural Information Processing Systems 2022, Neur...

2022

[45] [45]

Aäron van den Oord, Oriol Vinyals, and Koray Kavukcuoglu. 2017. Neural Discrete Representation Learning. InAdvances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, December 4-9, 2017, Long Beach, CA, USA. 6306–6315. https://proceedings.neurips. cc/paper/2017/hash/7a98af17e63a0ac09ce2e96d03992fb...

2017

[46] [46]

Wenjie Wang, Honghui Bao, Xinyu Lin, Jizhi Zhang, Yongqi Li, Fuli Feng, See- Kiong Ng, and Tat-Seng Chua. 2024. Learnable Item Tokenization for Generative Recommendation. InProceedings of the 33rd ACM International Conference on Information and Knowledge Management, CIKM 2024, Boise, ID, USA, October 21-25, 2024, Edoardo Serra and Francesca Spezzano (Eds....

work page doi:10.1145/3627673.3679569 2024

[47] [47]

Yujing Wang, Yingyan Hou, Haonan Wang, Ziming Miao, Shibin Wu, Qi Chen, Yuqing Xia, Chengmin Chi, Guoshuai Zhao, Zheng Liu, Xing Xie, Hao Sun, Weiwei Deng, Qi Zhang, and Mao Yang. 2022. A Neural Cor- pus Indexer for Document Retrieval. InAdvances in Neural Information Processing Systems 35: Annual Conference on Neural Information Processing Systems 2022, ...

2022

[48] [48]

Sam Wiseman and Alexander M. Rush. 2016. Sequence-to-Sequence Learn- ing as Beam-Search Optimization. InProceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, EMNLP 2016, Austin, Texas, USA, November 1-4, 2016. Association for Computational Linguistics, 1296–1306. doi:10.18653/v1/d16-1137

work page doi:10.18653/v1/d16-1137 2016

[49] [49]

Lee Xiong, Chenyan Xiong, Ye Li, Kwok-Fung Tang, Jialin Liu, Paul Bennett, Junaid Ahmed, and Arnold Overwijk. 2020. Approximate nearest neighbor nega- tive contrastive learning for dense text retrieval.arXiv preprint arXiv:2007.00808 (2020)

work page arXiv 2020

[50] [50]

Mohamed Yakout, Kris Ganjam, Kaushik Chakrabarti, and Surajit Chaudhuri

[51] [51]

InProceedings of the ACM SIGMOD International Conference on Management of Data, SIGMOD 2012, Scottsdale, AZ, USA, May 20-24, 2012, K

InfoGather: entity augmentation and attribute discovery by holistic match- ing with web tables. InProceedings of the ACM SIGMOD International Conference on Management of Data, SIGMOD 2012, Scottsdale, AZ, USA, May 20-24, 2012, K. Selçuk Candan, Yi Chen, Richard T. Snodgrass, Luis Gravano, and Ariel Fux- man (Eds.). ACM, 97–108. doi:10.1145/2213836.2213848

work page doi:10.1145/2213836.2213848 2012

[52] [52]

Pengcheng Yin, Graham Neubig, Wen-tau Yih, and Sebastian Riedel. 2020. TaBERT: Pretraining for Joint Understanding of Textual and Tabular Data. In Proceedings of the 58th Annual Meeting of the Association for Computational Lin- guistics, ACL 2020, Online, July 5-10, 2020, Dan Jurafsky, Joyce Chai, Natalie Schluter, and Joel R. Tetreault (Eds.). Associatio...

work page doi:10.18653/v1/2020.acl-main.745 2020

[53] [53]

Jiaqi Zhai, Lucy Liao, Xing Liu, Yueming Wang, Rui Li, Xuan Cao, Leon Gao, Zhaojie Gong, Fangda Gu, Jiayuan He, Yinghai Lu, and Yu Shi. 2024. Actions speak louder than words: trillion-parameter sequential transducers for generative recommendations. 26 pages

2024

[54] [54]

Dan Zhang, Madelon Hulsebos, Yoshihiko Suhara, Çağatay Demiralp, Jinfeng Li, and Wang-Chiew Tan. 2020. Sato: contextual semantic type detection in tables. Proc. VLDB Endow.13, 12 (July 2020), 1835–1848. doi:10.14778/3407790.3407793

work page doi:10.14778/3407790.3407793 2020

[55] [55]

Yi Zhang and Zachary G. Ives. 2020. Finding Related Tables in Data Lakes for Interactive Data Science. InProceedings of the 2020 International Conference on Management of Data, SIGMOD Conference 2020, online conference [Portland, OR, USA], June 14-19, 2020, David Maier, Rachel Pottinger, AnHai Doan, Wang- Chiew Tan, Abdussalam Alawini, and Hung Q. Ngo (Ed...

work page doi:10.1145/3318464.3389726 2020

[56] [56]

Bowen Zheng, Yupeng Hou, Hongyu Lu, Yu Chen, Wayne Xin Zhao, Ming Chen, and Ji-Rong Wen. 2024. Adapting Large Language Models by Integrating Collab- orative Semantics for Recommendation. In40th IEEE International Conference on Data Engineering, ICDE 2024, Utrecht, The Netherlands, May 13-16, 2024. IEEE, 1435–1448. doi:10.1109/ICDE60146.2024.00118

work page doi:10.1109/icde60146.2024.00118 2024

[57] [57]

Erkang Zhu, Dong Deng, Fatemeh Nargesian, and Renée J. Miller. 2019. JOSIE: Overlap Set Similarity Search for Finding Joinable Tables in Data Lakes. InPro- ceedings of the 2019 International Conference on Management of Data(Amsterdam, Netherlands)(SIGMOD ’19). Association for Computing Machinery, New York, NY, USA, 847–864. doi:10.1145/3299869.3300065 14

work page doi:10.1145/3299869.3300065 2019