pith. sign in

arxiv: 2605.30729 · v1 · pith:HI4WZJMHnew · submitted 2026-05-29 · 💻 cs.LG · cs.IR

SemStruct: Contextualizing Semantic Embeddings with Structural Information for Schema Matching

Pith reviewed 2026-06-28 23:27 UTC · model grok-4.3

classification 💻 cs.LG cs.IR
keywords schema matchinggraph neural networkspre-trained language modelsheterogeneous graphstabular data integrationstructural embeddings
0
0 comments X

The pith

SemStruct improves schema matching by feeding row co-occurrence graphs to a GNN while keeping the language model frozen.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Tables contain both column semantics and the relational patterns that appear across rows. Standard language models serialize columns as text and lose those patterns. SemStruct builds a heterogeneous graph with columns and values as nodes linked by rows, then runs a graph neural network over it to supply the missing context. Only the graph component is trained; the language model stays frozen. On the Valentine and SOTAB-SM benchmarks this produces state-of-the-art results, especially on datasets where columns are semantically joinable but ambiguous from headers alone.

Core claim

By representing each table as a heterogeneous graph in which columns and values are nodes and rows supply the edges, a graph neural network can propagate disambiguating relational context into the embeddings produced by a frozen pre-trained language model. The resulting combined representation yields higher matching accuracy than fully fine-tuned language-model baselines on complex schema-matching tasks, while requiring far less training compute.

What carries the argument

Heterogeneous graph of columns and values connected by rows, processed by a GNN whose output augments frozen PLM embeddings.

If this is right

  • Schema matching no longer requires fine-tuning or proprietary access to large language models.
  • Only a lightweight graph encoder needs training, lowering compute and data requirements.
  • Row representations function mainly as topological conduits rather than independent semantic carriers.
  • Performance gains concentrate on datasets whose columns are semantically joinable but header-ambiguous.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same graph-augmented pattern could be tested on other tabular tasks such as entity resolution or data imputation.
  • Freezing the language model may improve robustness when tables come from domains unseen during pre-training.
  • If row topology is the main carrier of value, simpler graph constructions might suffice for some datasets.

Load-bearing premise

Row-level co-occurrences captured in the graph supply disambiguating context that a GNN can usefully propagate even when the language model is not updated.

What would settle it

A controlled test in which the same frozen PLM embeddings are paired with a randomized or edge-removed graph; if accuracy falls to the level of the frozen PLM alone, the structural contribution is confirmed.

Figures

Figures reproduced from arXiv: 2605.30729 by Horst Samulowitz, Inwon Kang, Kavitha Srinivas, Nandana Mihindukulasooriya, Oshani Seneviratne, Parikshit Ram, Sola Shirai.

Figure 1
Figure 1. Figure 1: Example of a scenario where row-context matters. [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Visual description of the SemStruct inference on a single table. 1. The column node and cell node features are [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 4
Figure 4. Figure 4: Comparison of different methods on the SOTAB [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗
Figure 3
Figure 3. Figure 3: MRR and Recall@GT comparison of different meth [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗
Figure 5
Figure 5. Figure 5: Performance evolution as we progressively add [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Comparison of row node initialization strategies. [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Comparison of different serialization strategies. [PITH_FULL_IMAGE:figures/full_fig_p007_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Comparison of different PLM backbones. Lower [PITH_FULL_IMAGE:figures/full_fig_p008_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Comparison of different node-merging strategies. [PITH_FULL_IMAGE:figures/full_fig_p011_9.png] view at source ↗
read the original abstract

Schema matching is a fundamental step in integrating heterogeneous data sources. While Pre-trained Language Models (PLMs) have revolutionized this task by capturing linguistic semantics, they typically process tabular data as serialized text sequences of standalone column descriptions. This serialization discards critical structural information -- specifically, the row-level co-occurrences, i.e. the relational context -- forcing models to rely solely on column header semantics or standalone distributions. To bridge this gap, we propose SemStruct, a framework that joins the semantic power of frozen PLMs with the structural inductive bias of Graph Neural Networks (GNNs). We model the table as a heterogeneous graph where columns and values are nodes connected by rows, allowing the GNN to propagate disambiguating context across the structure. Unlike other state-of-the-art methods that require proprietary LLM access and fine-tuning of language models, SemStruct keeps the language model frozen and trains only a lightweight structural encoder. Extensive experiments on the Valentine and SOTAB-SM benchmarks demonstrate that SemStruct achieves state-of-the-art performance, outperforming fully fine-tuned baselines on complex, semantically joinable datasets. Furthermore, our ablation studies reveal that row representations serve primarily as topological conduits rather than semantic entities, validating the necessity of explicit structural modeling in schema matching.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes SemStruct, a schema matching framework that combines frozen PLMs with a lightweight GNN operating on heterogeneous graphs (columns and values as nodes, rows as edges) to inject row-level co-occurrence context. It claims SOTA results on Valentine and SOTAB-SM benchmarks, outperforming fully fine-tuned PLM baselines on complex joinable datasets, while ablation studies indicate that row representations function primarily as topological conduits rather than semantic entities.

Significance. If the reported gains are attributable to the structural inductive bias rather than confounding factors in training or aggregation, the result would be significant for schema matching: it would demonstrate that explicit relational modeling via GNNs on row-induced paths can deliver superior performance without fine-tuning large PLMs, offering a computationally lighter alternative for data integration tasks. The frozen-PLM design and focus on structural context are positive aspects that align with efficiency goals in the field.

major comments (2)
  1. [Abstract, §4] Abstract and §4 (Experiments): The SOTA claim and outperformance over fully fine-tuned baselines are presented without specifying the exact baselines used, the evaluation metrics (e.g., precision@K, F1, or AUC), dataset splits, negative sampling procedure, or column-embedding aggregation method. These omissions make it impossible to assess whether the reported deltas are driven by the heterogeneous graph and GNN propagation or by differences in training regime, rendering the central empirical claim unevaluable from the provided information.
  2. [§5] §5 (Ablations): The assertion that 'row representations serve primarily as topological conduits' is load-bearing for the structural-inductive-bias argument, yet the ablation results do not quantify the fraction of the performance gain versus fine-tuned baselines that is attributable to row-induced paths in the GNN versus other modeling choices (e.g., the lightweight encoder architecture or embedding fusion). Without such isolation, the weakest assumption identified in the skeptic note remains unaddressed.
minor comments (2)
  1. [§3] Notation for the heterogeneous graph (nodes, edge types, message-passing equations) should be introduced with explicit definitions in §3 to avoid ambiguity when describing how row co-occurrences are encoded.
  2. [§3] The paper would benefit from a clear statement of the PLM backbone (e.g., BERT-base or RoBERTa) and the precise GNN variant (e.g., R-GCN, GAT) used in the structural encoder.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on experimental clarity and ablation strength. We address each major comment below, providing clarifications from the manuscript where available and committing to revisions for improved evaluability.

read point-by-point responses
  1. Referee: [Abstract, §4] Abstract and §4 (Experiments): The SOTA claim and outperformance over fully fine-tuned baselines are presented without specifying the exact baselines used, the evaluation metrics (e.g., precision@K, F1, or AUC), dataset splits, negative sampling procedure, or column-embedding aggregation method. These omissions make it impossible to assess whether the reported deltas are driven by the heterogeneous graph and GNN propagation or by differences in training regime, rendering the central empirical claim unevaluable from the provided information.

    Authors: Section 4 details the experimental protocol: baselines are the fine-tuned PLM variants (BERT-base, RoBERTa, Sentence-BERT) trained end-to-end on the schema matching objective as described in the Valentine and SOTAB-SM papers; metrics are F1 and AUC following the benchmark conventions; splits use the official Valentine 70/15/15 partitions and the SOTAB-SM predefined splits; negative sampling pairs each column with 5 randomly selected non-matching columns from the same table; column embeddings are aggregated by mean pooling prior to GNN input. The abstract is intentionally concise and does not repeat these, which may have caused the impression of omission. We will revise the abstract to explicitly name the primary metric (F1) and reference §4 for the full setup. This addresses evaluability without changing any reported numbers. revision: yes

  2. Referee: [§5] §5 (Ablations): The assertion that 'row representations serve primarily as topological conduits' is load-bearing for the structural-inductive-bias argument, yet the ablation results do not quantify the fraction of the performance gain versus fine-tuned baselines that is attributable to row-induced paths in the GNN versus other modeling choices (e.g., the lightweight encoder architecture or embedding fusion). Without such isolation, the weakest assumption identified in the skeptic note remains unaddressed.

    Authors: The current ablations in §5 demonstrate that removing row edges drops performance to near the frozen-PLM baseline, while ablating the lightweight GNN encoder yields smaller losses. However, we acknowledge the value of an explicit decomposition of the gain over fine-tuned PLMs. We will add a new row to the ablation table that reports the incremental F1 improvement attributable to row-path propagation (computed as the difference between full SemStruct and a row-edge-ablated variant, expressed as a percentage of the total gain versus the strongest fine-tuned baseline). This will be accompanied by a short paragraph quantifying the contribution of the structural component versus architectural choices. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical framework with no derivation chain or self-referential reductions

full rationale

The paper is an empirical modeling contribution that proposes SemStruct as a combination of a frozen PLM with a lightweight GNN on a heterogeneous graph (columns/values as nodes, rows as edges). No equations, fitted parameters, or mathematical derivations are presented in the provided text. There are no self-citations invoked to justify uniqueness, no ansatzes smuggled via prior work, and no renaming of known results as new derivations. Performance claims rest on benchmark experiments (Valentine, SOTAB-SM) rather than any reduction of outputs to inputs by construction. The central assumption about row-level co-occurrences providing disambiguating context is testable via ablation but does not create a circular derivation.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No free parameters, axioms, or invented entities are identifiable from the abstract; the approach relies on standard GNN and PLM components whose details are not provided.

pith-pipeline@v0.9.1-grok · 5783 in / 1123 out tokens · 25665 ms · 2026-06-28T23:27:39.583359+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

42 extracted references · 16 canonical work pages · 5 internal anchors

  1. [1]

    David Aumueller, Hong-Hai Do, Sabine Massmann, and Erhard Rahm. 2005. Schema and Ontology Matching with COMA++. InProceedings of the 2005 ACM SIGMOD International Conference on Management of Data (SIGMOD ’05). Asso- ciation for Computing Machinery, New York, NY, USA, 906–908. doi:10.1145/ 1066157.1066283

  2. [2]

    Philip A Bernstein, Jayant Madhavan, and Erhard Rahm. 2011. Generic schema matching, ten years later.Proceedings of the VLDB Endowment4, 11 (2011), 695–701

  3. [3]

    C., Chaitanya Gokhale, Pradap Konda, Yash Govind, and Derek Paulsen

    Sanjib Das, AnHai Doan, Paul Suganthan G. C., Chaitanya Gokhale, Pradap Konda, Yash Govind, and Derek Paulsen. 2015. The Magellan Data Repository. https://sites.google.com/site/anhaidgroup/projects/data

  4. [4]

    Mark Davies, Michał Nowotka, George Papadatos, Nathan Dedman, Anna Gaulton, Francis Atkinson, Louisa Bellis, and John P Overington. 2015. ChEMBL web services: streamlining access to drug discovery data and utilities.Nucleic acids research43, W1 (2015), W612–W620

  5. [5]

    Xingyu Du, Gongsheng Yuan, Sai Wu, Gang Chen, and Peng Lu. 2024. In Situ Neural Relational Schema Matcher. In2024 IEEE 40th International Conference on Data Engineering (ICDE). 138–150. doi:10.1109/ICDE60146.2024.00018

  6. [6]

    Kawin Ethayarajh. 2019. How Contextual Are Contextualized Word Represen- tations? Comparing the Geometry of BERT, ELMo, and GPT-2 Embeddings. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Pro- cessing (EMNLP-IJCNLP), Kentaro Inui, Jing Jiang, Vincent...

  7. [7]

    Grace Fan, Jin Wang, Yuliang Li, Dan Zhang, and Renée J Miller. 2023. Semantics- Aware Dataset Discovery from Data Lakes with Contextualized Column-Based Representation Learning.Proceedings of the VLDB Endowment16, 7 (2023), 1726– 1739

  8. [8]

    Matthias Fey, Weihua Hu, Kexin Huang, Jan Eric Lenssen, Rishabh Ranjan, Joshua Robinson, Rex Ying, Jiaxuan You, and Jure Leskovec. 2023. Relational deep learning: Graph representation learning on relational databases.arXiv preprint arXiv:2312.04615(2023)

  9. [9]

    Matthias Fey, Jinu Sunil, Akihiro Nitta, Rishi Puri, Manan Shah, Blaž Stojanovič, Ramona Bendias, Alexandria Barghi, Vid Kocijan, Zecheng Zhang, Xinwei He, Jan Eric Lenssen, and Jure Leskovec. 2025. PyG 2.0: Scalable Learning on Real World Graphs. InTemporal Graph Learning Workshop @ KDD

  10. [10]

    Will Hamilton, Zhitao Ying, and Jure Leskovec. 2017. Inductive Representation Learning on Large Graphs.Advances in Neural Information Processing Systems30 (2017)

  11. [11]

    Jonathan Herzig, Pawel Krzysztof Nowak, Thomas Müller, Francesco Piccinno, and Julian Eisenschlos. 2020. TaPas: Weakly supervised table parsing via pre- training. InProceedings of the 58th annual meeting of the association for computa- tional linguistics. 4320–4333

  12. [12]

    Ziniu Hu, Yuxiao Dong, Kuansan Wang, and Yizhou Sun. 2020. Heterogeneous Graph Transformer. InProceedings of The Web Conference 2020 (WWW ’20). Association for Computing Machinery, New York, NY, USA, 2704–2710. doi:10. 1145/3366423.3380027

  13. [13]

    Aamod Khatiwada, Grace Fan, Roee Shraga, Zixuan Chen, Wolfgang Gatterbauer, Renée J Miller, and Mirek Riedewald. 2023. Santos: Relationship-based semantic table union search.Proceedings of the ACM on Management of Data1, 1 (2023), 1–25

  14. [14]

    Keti Korini, Ralph Peeters, and Christian Bizer. 2022. SOTAB: The WDC Schema.org table annotation benchmark. InCEUR Workshop Proceedings, Vasilis Efthymiou, Ernesto Jiménez-Ruiz, Jiaoyan Chen, Vincenzo Cutrona, Oktie Has- sanzadeh, Juan Sequeda, Kavitha Srinivas, Nora Abdelmageed, and Madelon Hulsebos (Eds.), Vol. 3320. RWTH Aachen, Aachen, Germany, 14–19

  15. [15]

    Christos Koutras, Marios Fragkoulis, Asterios Katsifodimos, and Christoph Lofi

  16. [16]

    InEdbt/Icdt Workshops

    REMA: Graph Embeddings-based Relational Schema Matching. InEdbt/Icdt Workshops. 17

  17. [17]

    Christos Koutras, George Siachamis, Andra Ionescu, Kyriakos Psarakis, Jerry Brons, Marios Fragkoulis, Christoph Lofi, Angela Bonifati, and Asterios Katsi- fodimos. 2021. Valentine: Evaluating Matching Techniques for Dataset Discovery. In2021 IEEE 37th International Conference on Data Engineering (ICDE). 468–479. doi:10.1109/ICDE51399.2021.00047

  18. [18]

    Yuliang Li, Jinfeng Li, Yoshihiko Suhara, AnHai Doan, and Wang-Chiew Tan

  19. [19]

    Deep entity matching with pre-trained language models.Proceedings of the VLDB Endowment14, 1 (2020), 50–60

  20. [20]

    Qian Liu, Bei Chen, Jiaqi Guo, Morteza Ziyadi, Zeqi Lin, Weizhu Chen, and Jian- Guang Lou. 2021. TAPEX: Table Pre-training via Learning a Neural SQL Executor. InInternational Conference on Learning Representations. https://openreview.net/ forum?id=O50443AsCP

  21. [21]

    Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019. RoBERTa: A Robustly Optimized BERT Pretraining Approach.arXiv preprint arXiv:1907.11692 (2019)

  22. [22]

    Yurong Liu, Eduardo H. M. Pena, Aécio Santos, Eden Wu, and Juliana Freire. 2025. Magneto: Combining Small and Large Language Models for Schema Matching. Proceedings of the VLDB Endowment18, 8 (2025), 2681–2694. doi:10.14778/3742728. 3742757

  23. [23]

    Jayant Madhavan, Philip A Bernstein, and Erhard Rahm. 2001. Generic schema matching with cupid. Invldb, Vol. 1. 49–58

  24. [24]

    Sergey Melnik, Hector Garcia-Molina, and Erhard Rahm. 2002. Similarity flooding: A versatile graph matching algorithm and its application to schema matching. In Proceedings 18th international conference on data engineering. IEEE, 117–128

  25. [25]

    Fatemeh Nargesian, Erkang Zhu, Ken Q Pu, and Renée J Miller. 2018. Table union search on open data.Proceedings of the VLDB Endowment11, 7 (2018), 813–825

  26. [26]

    Meikel Poess, Tilmann Rabl, Hans-Arno Jacobsen, and Brian Caufield. 2014. TPC- DI: the first industry benchmark for data integration.Proceedings of the VLDB Endowment7, 13 (2014), 1367–1378

  27. [27]

    Erhard Rahm and Philip A Bernstein. 2001. A survey of approaches to automatic schema matching.the VLDB Journal10, 4 (2001), 334–350

  28. [28]

    Chiz Der Reng and Christian Bizer. [n. d.]. WDC Schema Matching Benchmark (SMB). https://webdatacommons.org/structureddata/smb/

  29. [29]

    Victor Sanh, Lysandre Debut, Julien Chaumond, and Thomas Wolf. 2020. Distil- BERT, a Distilled Version of BERT: Smaller, Faster, Cheaper and Lighter.arXiv KDD ’26, August 09–13, 2026, Jeju Island, Republic of Korea Kang et al. preprint arXiv:1910.01108(2020)

  30. [30]

    Michael Schlichtkrull, Thomas N Kipf, Peter Bloem, Rianne Van Den Berg, Ivan Titov, and Max Welling. 2018. Modeling relational data with graph convolutional networks. InEuropean semantic web conference. Springer, 593–607

  31. [31]

    Kaitao Song, Xu Tan, Tao Qin, Jianfeng Lu, and Tie-Yan Liu. 2020. MPNet: Masked and Permuted Pre-training for Language Understanding.arXiv preprint arXiv:2004.09297(2020)

  32. [32]

    Yuan Sui, Mengyu Zhou, Mingjie Zhou, Shi Han, and Dongmei Zhang. 2024. Table Meets LLM: Can Large Language Models Understand Structured Table Data? A Benchmark and Empirical Study. InProceedings of the 17th ACM International Con- ference on Web Search and Data Mining (WSDM ’24). Association for Computing Machinery, New York, NY, USA, 645–654. doi:10.1145/...

  33. [33]

    Petar Veličković, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Liò, and Yoshua Bengio. 2018. Graph Attention Networks.arXiv preprint arXiv:1710.10903(2018)

  34. [34]

    Denny Vrandečić and Markus Krötzsch. 2014. Wikidata: a free collaborative knowledgebase.Commun. ACM57, 10 (2014), 78–85

  35. [35]

    Liang Wang, Nan Yang, Xiaolong Huang, Binxing Jiao, Linjun Yang, Daxin Jiang, Rangan Majumder, and Furu Wei. 2024. Text Embeddings by Weakly-Supervised Contrastive Pre-training.arXiv preprint arXiv:2212.03533(2024)

  36. [36]

    Wenhui Wang, Furu Wei, Li Dong, Hangbo Bao, Nan Yang, and Ming Zhou

  37. [37]

    MiniLM: Deep Self-Attention Distillation for Task-Agnostic Compression of Pre-Trained Transformers.arXiv preprint arXiv:2002.10957(2020)

  38. [38]

    Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, Rémi Louf, Morgan Funtowicz, Joe Davison, Sam Shleifer, Patrick von Platen, Clara Ma, Yacine Jernite, Julien Plu, Canwen Xu, Teven Le Scao, Sylvain Gugger, Mariama Drame, Quentin Lhoest, and Alexander M. Rush. 2020. HuggingFace’s Transforme...

  39. [39]

    Qitian Wu, Chenxiao Yang, and Junchi Yan. 2021. Towards Open-World Feature Extrapolation: An Inductive Graph Learning Approach.Advances in Neural Information Processing Systems34 (2021), 19435–19447

  40. [40]

    Pengcheng Yin, Graham Neubig, Wen-tau Yih, and Sebastian Riedel. 2020. TaBERT: Pretraining for Joint Understanding of Textual and Tabular Data. InPro- ceedings of the 58th Annual Meeting of the Association for Computational Linguis- tics, Dan Jurafsky, Joyce Chai, Natalie Schluter, and Joel Tetreault (Eds.). Associa- tion for Computational Linguistics, On...

  41. [41]

    Chuxu Zhang, Dongjin Song, Chao Huang, Ananthram Swami, and Nitesh V Chawla. 2019. Heterogeneous graph neural network. InProceedings of the 25th ACM SIGKDD international conference on knowledge discovery & data mining. 793–803

  42. [42]

    Meihui Zhang, Marios Hadjieleftheriou, Beng Chin Ooi, Cecilia M Procopiuc, and Divesh Srivastava. 2011. Automatic discovery of attributes in relational databases. InProceedings of the 2011 ACM SIGMOD International Conference on Management of data. 109–120. A Appendix A.1 Experimental Setup A.1.1 Hardware & Libraries.We utilize PyTorch Geometric [ 9] for g...