pith. sign in

arxiv: 1907.03595 · v2 · pith:JQ5MABQZnew · submitted 2019-07-08 · 💻 cs.IR

Recommending Related Tables

Pith reviewed 2026-05-25 00:57 UTC · model grok-4.3

classification 💻 cs.IR
keywords related table recommendationtable matchingsemantic spacesdiscriminative learningWikipedia tablesinformation retrieval
0
0 comments X

The pith

Tables are recommended by embedding their elements in multiple semantic spaces and learning to combine the similarities.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper addresses the task of recommending related tables to a given input table. The approach represents table elements in multiple semantic spaces and uses a discriminative learning model to combine element-level similarities for computing table similarity. It is evaluated on a test collection of Wikipedia tables where it achieves state-of-the-art performance. If correct, this enables applications like providing web-based related content recommendations within spreadsheet programs.

Core claim

The paper establishes a theoretically sound framework for table matching based on multi-space element representations combined via discriminative learning, which outperforms prior methods on Wikipedia table data.

What carries the argument

Representation of table elements in multiple semantic spaces combined using a discriminative learning model to compute table similarity.

If this is right

  • Proactive recommendations of related structured content can be provided to spreadsheet users.
  • Table similarity computation becomes more accurate by leveraging multiple semantic views.
  • Ranked lists of relevant tables can be generated effectively from large collections like Wikipedia.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • This approach might generalize to matching other structured data formats beyond tables.
  • Deployment in enterprise environments would require validating the method on non-Wikipedia data.
  • Future work could explore additional semantic spaces or different learning models for combination.

Load-bearing premise

The purpose-built test collection from Wikipedia tables is representative of real-world table recommendation scenarios.

What would settle it

Demonstrating that the method does not outperform baselines on a collection of enterprise spreadsheets would falsify the claim of state-of-the-art performance in practical settings.

Figures

Figures reproduced from arXiv: 1907.03595 by Krisztian Balog, Shuo Zhang.

Figure 1
Figure 1. Figure 1: ‡e task of related table recommendation is to re [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Representation of a table element Tx in the term and in a given semantic space y. 3.1 Element-Level Table Matching Framework We combine multiple table quality indicators and table similarity measures in a discriminative learning framework. Input and candi￾date table pairs are described as a feature vector, shown in Eq. (1). Œe main novelty lies in how table similarity is estimated. Instead of relying on ha… view at source ↗
Figure 3
Figure 3. Figure 3: Illustration of element-level similarity methods. [PITH_FULL_IMAGE:figures/full_fig_p003_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Performance di‚erence between InfoGather (base [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Performance in terms of NDCG with di‚erent [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗
Figure 8
Figure 8. Figure 8: Performance of CRAB-2 with respect to (relative) [PITH_FULL_IMAGE:figures/full_fig_p008_8.png] view at source ↗
Figure 7
Figure 7. Figure 7: Performance analysis using only a portion of the [PITH_FULL_IMAGE:figures/full_fig_p008_7.png] view at source ↗
read the original abstract

Tables are an extremely powerful visual and interactive tool for structuring and manipulating data, making spreadsheet programs one of the most popular computer applications. In this paper we introduce and address the task of recommending related tables: given an input table, identifying and returning a ranked list of relevant tables. One of the many possible application scenarios for this task is to provide users of a spreadsheet program proactively with recommendations for related structured content on the Web. At its core, the related table recommendation task boils down to computing the similarity between a pair of tables. We develop a theoretically sound framework for performing table matching. Our approach hinges on the idea of representing table elements in multiple semantic spaces, and then combining element-level similarities using a discriminative learning model. Using a purpose-built test collection from Wikipedia tables, we demonstrate that the proposed approach delivers state-of-the-art performance.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 1 minor

Summary. The paper introduces the task of recommending related tables to a given input table, motivated by applications such as proactive recommendations in spreadsheet programs. It develops a framework for table matching that represents table elements in multiple semantic spaces and combines element-level similarities via a discriminative learning model. Using a purpose-built test collection derived from Wikipedia tables, the approach is shown to achieve state-of-the-art performance.

Significance. If the results hold, the multi-semantic-space representation offers a principled way to capture different facets of table similarity, which could benefit structured data recommendation systems. The construction of a purpose-built test collection from Wikipedia tables is a positive contribution that enables future work on this task.

major comments (1)
  1. [Experiments] Experiments section: The SOTA claim and applicability to the scenarios in the introduction (proactive spreadsheet recommendations, web structured content) rest on results from the purpose-built Wikipedia test collection, but no details are given on collection construction, relevance judgment protocol, inter-annotator agreement, or any cross-domain validation. This is load-bearing because the collection's element distributions, schema variability, and relevance criteria may not match enterprise spreadsheets or user-generated content.
minor comments (1)
  1. The abstract would benefit from specifying the evaluation metrics (e.g., MAP or NDCG) used to establish state-of-the-art performance.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback. The primary concern raised is the level of detail provided on the test collection and its implications for the SOTA claims and applicability. We address this point below and will revise the manuscript accordingly.

read point-by-point responses
  1. Referee: Experiments section: The SOTA claim and applicability to the scenarios in the introduction (proactive spreadsheet recommendations, web structured content) rest on results from the purpose-built Wikipedia test collection, but no details are given on collection construction, relevance judgment protocol, inter-annotator agreement, or any cross-domain validation. This is load-bearing because the collection's element distributions, schema variability, and relevance criteria may not match enterprise spreadsheets or user-generated content.

    Authors: We agree that expanded details on the test collection are warranted to support the claims. Section 4 describes the Wikipedia table sampling process and pairing strategy, but the relevance judgment protocol, inter-annotator agreement statistics, and explicit discussion of schema variability were not elaborated sufficiently. In the revised version we will add a dedicated subsection detailing the judgment guidelines, report agreement measures, and include a limitations paragraph addressing differences from enterprise spreadsheets and user-generated content. We maintain that the collection serves as a valid proxy for web structured data (consistent with prior table corpora), but acknowledge the absence of cross-domain experiments and will frame the results accordingly without overgeneralizing applicability. revision: yes

Circularity Check

0 steps flagged

No circularity; derivation is self-contained empirical framework

full rationale

The paper introduces a table matching framework based on multi-semantic-space element representations combined via a discriminative learning model, then reports empirical SOTA results on a purpose-built Wikipedia test collection. No equations, parameter-fitting procedures, or self-citations are visible that would reduce the claimed similarity computation or performance result to the inputs by construction. The central claim rests on standard representation and supervised combination techniques evaluated externally on held-out data rather than any self-definitional, fitted-input-renamed-as-prediction, or self-citation-load-bearing step.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Only the abstract is provided; no free parameters, axioms, or invented entities can be identified from the given text.

pith-pipeline@v0.9.0 · 5659 in / 1059 out tokens · 17368 ms · 2026-05-25T00:57:04.834054+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

50 extracted references · 50 canonical work pages · 2 internal anchors

  1. [1]

    Ahmad Ahmadov, Maik /T_hiele, Julian Eberius, Wolfgang Lehner, and Robert Wrembel. 2015. Towards a Hybrid Imputation Approach Using Web Tables.. In Proc. of BDC ’15 . 21–30

  2. [2]

    Anonymous. 2017. Removed to Protect Anonymity. (2017)

  3. [3]

    Halevy, Boulos Harb, Hongrae Lee, Jayant Mad- havan, Afshin Rostamizadeh, Warren Shen, Kenneth Wilder, Fei Wu, and Cong Yu

    Sreeram Balakrishnan, Alon Y. Halevy, Boulos Harb, Hongrae Lee, Jayant Mad- havan, Afshin Rostamizadeh, Warren Shen, Kenneth Wilder, Fei Wu, and Cong Yu. 2015. Applying WebTables in Practice. In Proc. of CIDR ’15

  4. [4]

    Somnath Banerjee, Soumen Chakrabarti, and Ganesh Ramakrishnan. 2009. Learn- ing to Rank for /Q_uantity Consensus /Q_ueries. InProc. of SIGIR ’09 . 243–250

  5. [5]

    Chandra Sekhar Bhagavatula, /T_hanapon Noraset, and Doug Downey. 2013. Meth- ods for Exploring and Mining Tables on Wikipedia. In Proc. of IDEA ’13 . 18–26

  6. [6]

    Chandra Sekhar Bhagavatula, /T_hanapon Noraset, and Doug Downey. 2015. TabEL: Entity Linking in Web Tables. InProc. of ISWC 2015. 425–441

  7. [7]

    Cafarella, Alon Halevy, and Nodira Khoussainova

    Michael J. Cafarella, Alon Halevy, and Nodira Khoussainova. 2009. Data Integra- tion for the Relational Web. Proc. of VLDB Endow. 2 (2009), 1090–1101

  8. [8]

    Cafarella, Alon Halevy, and Jayant Madhavan

    Michael J. Cafarella, Alon Halevy, and Jayant Madhavan. 2011. Structured Data on the Web. Commun. ACM 54 (2011), 72–79

  9. [9]

    Cafarella, Alon Halevy, Daisy Zhe Wang, Eugene Wu, and Yang Zhang

    Michael J. Cafarella, Alon Halevy, Daisy Zhe Wang, Eugene Wu, and Yang Zhang

  10. [10]

    WebTables: Exploring the Power of Tables on the Web. Proc. of VLDB Endow. 1 (2008), 538–549

  11. [11]

    Fernando Chirigati, Jialu Liu, Flip Korn, You (Will) Wu, Cong Yu, and Hao Zhang

  12. [12]

    Knowledge Exploration Using Tables on the Web. Proc. of VLDB Endow. 10 (2016), 193–204

  13. [13]

    Eric Crestan and Patrick Pantel. 2011. Web-scale Table Census and Classi/f_ication. In Proc. of WSDM ’11 . 545–554

  14. [14]

    Anish Das Sarma, Lujun Fang, Nitin Gupta, Alon Halevy, Hongrae Lee, Fei Wu, Reynold Xin, and Cong Yu. 2012. Finding Related Tables. In Proc. of SIGMOD ’12 . 817–828

  15. [15]

    Xin Dong, Evgeniy Gabrilovich, Geremy Heitz, Wilko Horn, Ni Lao, Kevin Murphy, /T_homas Strohmann, Shaohua Sun, and Wei Zhang. 2014. Knowledge Vault: A Web-scale Approach to Probabilistic Knowledge Fusion. InProc. of KDD ’14. 601–610

  16. [16]

    Fleiss et al

    J.L. Fleiss et al. 1971. Measuring nominal scale agreement among many raters. Psychological Bulletin 76 (1971), 378–382

  17. [17]

    Faegheh Hasibi, Krisztian Balog, Dar ´ıo Gariglio/t_ti, and Shuo Zhang. 2017. Nordlys: A Toolkit for Entity-Oriented and Semantic Search. In Proc. of SIGIR ’17. 1289–1292

  18. [18]

    Yusra Ibrahim, Mirek Riedewald, and Gerhard Weikum. 2016. Making Sense of Entities and /Q_uantities in Web Tables. InProc. of CIKM ’16 . 1703–1712

  19. [19]

    Oliver Lehmberg, Dominique Ritze, Robert Meusel, and Christian Bizer. 2016. A Large Public Corpus of Web Tables Containing Time and Context Metadata. In Proc. of WWW ’16 Companion . 75–76

  20. [20]

    Oliver Lehmberg, Dominique Ritze, Petar Ristoski, Robert Meusel, Heiko Paul- heim, and Christian Bizer. 2015. /T_he Mannheim Search Join Engine.Web Semant. 35 (2015), 159–166

  21. [21]

    Girija Limaye, Sunita Sarawagi, and Soumen Chakrabarti. 2010. Annotating and Searching Web Tables Using Entities, Types and Relationships. Proc. of VLDB Endow. 3 (2010), 1338–1347

  22. [22]

    Craig Macdonald, Rodrygo L T Santos, and Iadh Ounis. 2012. On the Usefulness of /Q_uery Features for Learning to Rank. InProc. of CIKM ’12 . 2559–2562

  23. [23]

    Jayant Madhavan, Loredana Afanasiev, Lyublena Antova, and Alon Y. Halevy

  24. [24]

    Harnessing the Deep Web: Present and Future

    Harnessing the Deep Web: Present and Future. CoRR abs/0909.1785 (2009)

  25. [25]

    Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013. Distributed Representations of Words and Phrases and /T_heir Compositionality. In Proc. of NIPS ’13 . 3111–3119

  26. [26]

    Emir Munoz, Aidan Hogan, and Alessandra Mileo. 2014. Using Linked Data to Mine RDF from Wikipedia’s Tables. In Proc. of WSDM ’14 . 533–542

  27. [27]

    Neural Programmer: Inducing Latent Programs with Gradient Descent

    Arvind Neelakantan, /Q_uoc V. Le, and Ilya Sutskever. 2015. Neural Programmer: Inducing Latent Programs with Gradient Descent. CoRR abs/1511.04834 (2015)

  28. [28]

    /T_hanh Tam Nguyen, /Q_uoc Viet Hung Nguyen, Weidlich Ma/t_thias, and Aberer Karl. 2015. Result Selection and Summarization for Web Table Search. In ISDE ’15. 425–441

  29. [29]

    Jeffrey Pennington, Richard Socher, and Christopher D Manning. 2014. GloVe: Global Vectors for Word Representation. InProc. of EMNLP ’14 . 1532–1543

  30. [30]

    Rakesh Pimplikar and Sunita Sarawagi. 2012. Answering Table /Q_ueries on the Web Using Column Keywords. Proc. of VLDB Endow. 5 (2012), 908–919

  31. [31]

    Tao Qin, Tie-Yan Liu, Jun Xu, and Hang Li. 2010. LETOR: A Benchmark Collection for Research on Learning to Rank for Information Retrieval. Inf. Retr. 13, 4 (Aug 2010), 346–374

  32. [32]

    Petar Ristoski and Heiko Paulheim. 2016. RDF2vec: RDF Graph Embeddings for Data Mining. In Proc. of ISWC ’16. 498–514

  33. [33]

    Dominique Ritze, Oliver Lehmberg, Yaser Oulabi, and Christian Bizer. 2016. Pro/f_iling the Potential of Web Tables for Augmenting Cross-domain Knowledge Bases. In Proc. of WWW ’16 . 251–261

  34. [34]

    Sunita Sarawagi and Soumen Chakrabarti. 2014. Open-domain /Q_uantity /Q_ueries on Web Tables: Annotation, Response, and Consensus Models. In Proc. of KDD ’14. 711–720

  35. [35]

    Sekhavat, Francesco Di Paolo, Denilson Barbosa, and Paolo Merialdo

    Yoones A. Sekhavat, Francesco Di Paolo, Denilson Barbosa, and Paolo Merialdo

  36. [36]

    Knowledge Base Augmentation using Tabular Data. In Proc. of LDOW ’14

  37. [37]

    Wei Shen, Jianyong Wang, and Jiawei Han. 2015. Entity Linking with a Knowl- edge Base: Issues, Techniques, and Solutions. IEEE Trans. Knowl. Data Eng. 27, 2 (feb 2015), 443–460

  38. [38]

    Cees G. M. Snoek, Marcel Worring, and Arnold W. M. Smeulders. 2005. Early Versus Late Fusion in Semantic Video Analysis. In Proc. of MULTIMEDIA ’05 . 399–402

  39. [39]

    Petros Venetis, Alon Halevy, Jayant Madhavan, Marius Pas ¸ca, Warren Shen, Fei Wu, Gengxin Miao, and Chung Wu. 2011. Recovering Semantics of Tables on the Web. Proc. of VLDB Endow. 4 (2011), 528–538

  40. [40]

    Jiannan Wang, Guoliang Li, and Jianhua Fe. 2011. Fast-join: An Efficient Method for Fuzzy Token Matching Based String Similarity Join. In Proc. of ICDE ’11 . 458–469

  41. [41]

    Mohamed Yakout, Kris Ganjam, Kaushik Chakrabarti, and Surajit Chaudhuri

  42. [42]

    InfoGather: Entity Augmentation and A/t_tribute Discovery by Holistic Matching with Web Tables. In Proc. of SIGMOD ’12 . 97–108

  43. [43]

    Pengcheng Yin, Zhengdong Lu, Hang Li, and Ben Kao. 2016. Neural Enquirer: Learning to /Q_uery Tables in Natural Language. InProc. of IJCAI ’16 . 2308–2314

  44. [44]

    Meihui Zhang and Kaushik Chakrabarti. 2013. InfoGather+: Semantic Matching and Annotation of Numeric and Time-varying A/t_tributes in Web Tables. InProc. of SIGMOD ’13. 145–156

  45. [45]

    Shuo Zhang and Krisztian Balog. 2017. Design Pa/t_terns for Fusion-Based Ob- ject Retrieval. In Proceedings of the 39th European conference on Advances in Information Retrieval (ECIR ’17) . Springer, 684–690

  46. [46]

    Shuo Zhang and Krisztian Balog. 2017. EntiTables: Smart Assistance for Entity- Focused Tables. In Proc. of SIGIR ’17 . 255–264

  47. [47]

    Shuo Zhang and Krisztian Balog. 2018. Ad Hoc Table Retrieval Using Semantic Similarity. In Proceedings of /T_he Web Conference (WWW ’18). 1553–1562

  48. [48]

    Shuo Zhang and Krisztian Balog. 2018. On-the-/f_ly Table Generation. InProceed- ings of 41st International ACM SIGIR Conference on Research and Development in Information Retrieval

  49. [49]

    Stefan Zwicklbauer, Christoph Einsiedler, Michael Granitzer, and Christin Seifert

  50. [50]

    Towards Disambiguating Web Tables. In Proc. of ISWC-PD’ 13. 205–208