ConStruM: A Structure-Guided LLM Framework for Context-Aware Schema Matching
Pith reviewed 2026-05-16 10:11 UTC · model grok-4.3
The pith
ConStruM improves schema matching by assembling small context packs from a tree and hypergraph to ground LLM decisions.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
ConStruM constructs a context tree for multi-level budgeted retrieval and a similarity hypergraph to surface similar column groups with group-aware cues, then assembles query-specific context packs that augment LLM prompts to produce better-grounded final matching selections.
What carries the argument
A context tree for budgeted multi-level context retrieval paired with a global similarity hypergraph that surfaces similar column groups and computes differentiation cues to enable compact evidence packing.
Load-bearing premise
A small budgeted context pack drawn from the context tree and hypergraph contains enough discriminative evidence to improve matching decisions.
What would settle it
Measure matching accuracy on the same real datasets with the ConStruM context pack versus an empty or random pack; no accuracy gain would falsify the central claim.
Figures
read the original abstract
Column matching is a central task in reconciling schemas for data integration. Column names and descriptions are valuable for this task. LLMs can leverage such natural-language schema metadata. However, in many datasets, correct matching requires additional evidence beyond the column itself. Because it is impractical to provide an LLM with the entire schema metadata needed to capture this evidence, the core challenge becomes to select and organize the most useful contextual information. We present ConStruM, a structure-guided framework for budgeted evidence packing in schema matching. ConStruM constructs a lightweight, reusable structure in which, at query time, it assembles a small context pack emphasizing the most discriminative evidence. ConStruM is designed as an add-on: given a shortlist of candidate targets produced by an upstream matcher, it augments the matcher's final LLM prompt with structured, query-specific evidence so that the final selection is better grounded. For this purpose, we develop a context tree for budgeted multi-level context retrieval and a global similarity hypergraph that surfaces groups of highly similar columns (on both the source and target sides), summarized via group-aware differentiation cues computed online or precomputed offline. Experiments on real datasets show that ConStruM improves matching by providing and organizing the right contextual evidence.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper presents ConStruM, a structure-guided add-on framework for LLM-based schema matching. Given an upstream shortlist of candidate targets, it builds a reusable context tree for budgeted multi-level retrieval and a global similarity hypergraph that identifies groups of similar columns on both source and target sides. At query time these structures are used to assemble a small, query-specific context pack containing discriminative evidence (including group-aware differentiation cues) that augments the final LLM prompt. Experiments on real datasets are reported to show improved matching accuracy by supplying and organizing the right contextual evidence without providing the entire schema.
Significance. If the central claim holds, ConStruM would offer a practical, reusable mechanism for managing context budgets in LLM-driven data integration tasks. By separating structure construction from query-time packing it could reduce prompt length while preserving signal that column names and descriptions alone cannot supply, with potential applicability to large-scale schema reconciliation where full metadata is impractical.
major comments (2)
- [Experiments] Experiments section: the claim that ConStruM improves matching by providing the right contextual evidence is not supported by any quantitative comparison to a full-schema-metadata baseline or to alternative context-selection heuristics; without these controls it is impossible to determine whether observed gains arise from the context tree and hypergraph or simply from prompt formatting.
- [§3.2] §3.2 (global similarity hypergraph): the description of how group-aware differentiation cues are computed (online or precomputed) does not specify the similarity metric, grouping threshold, or any ablation that isolates the hypergraph's contribution from the context tree alone, leaving the sufficiency of the budgeted pack unverified.
minor comments (2)
- [Abstract] The abstract and introduction refer to 'real datasets' without naming them or providing repository links, which hinders reproducibility.
- [§3] Notation for the context tree levels and hypergraph edges is introduced without a consolidated table of symbols, making the framework description harder to follow.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on our manuscript. The comments highlight valuable opportunities to strengthen the experimental controls and clarify the technical details of the hypergraph. We address each major comment below and commit to revisions that directly respond to the concerns raised.
read point-by-point responses
-
Referee: [Experiments] Experiments section: the claim that ConStruM improves matching by providing the right contextual evidence is not supported by any quantitative comparison to a full-schema-metadata baseline or to alternative context-selection heuristics; without these controls it is impossible to determine whether observed gains arise from the context tree and hypergraph or simply from prompt formatting.
Authors: We agree that the current evaluation would be strengthened by explicit controls. The manuscript notes that full-schema metadata is often impractical due to token constraints, which motivates our budgeted approach; however, we acknowledge the need for quantitative isolation of contributions. In the revision we will add: (1) a full-schema baseline truncated to the same token budget as ConStruM, (2) random context selection within the budget, and (3) a simple heuristic baseline (top-k columns by name/description similarity). We will also report token counts and prompt structures across conditions to rule out formatting effects. These additions will appear in an expanded Experiments section with new tables and discussion. revision: yes
-
Referee: [§3.2] §3.2 (global similarity hypergraph): the description of how group-aware differentiation cues are computed (online or precomputed) does not specify the similarity metric, grouping threshold, or any ablation that isolates the hypergraph's contribution from the context tree alone, leaving the sufficiency of the budgeted pack unverified.
Authors: We appreciate this observation and agree the description in §3.2 is insufficiently precise. The similarity metric is cosine similarity over embeddings of concatenated column name and description; groups are formed with a threshold of 0.8; differentiation cues are the per-group difference vectors between the query column and group centroids. These values can be precomputed offline or calculated online. We will revise §3.2 to state these parameters explicitly, include pseudocode, and add an ablation study comparing (context tree only) versus (context tree + hypergraph). The new results will directly verify the hypergraph's incremental value within the budgeted pack. revision: yes
Circularity Check
No circularity in derivation chain
full rationale
The paper describes a structure-guided framework (context tree + similarity hypergraph) for assembling budgeted context packs for LLM-based schema matching. No equations, derivations, fitted parameters, or self-referential definitions appear in the provided text. Central claims rest on experimental results on real datasets rather than any reduction to inputs by construction. No load-bearing self-citations or uniqueness theorems are invoked. This is a normal non-finding for a systems/framework paper.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption LLMs can leverage natural-language schema metadata for column matching when supplied with appropriate additional context
invented entities (2)
-
context tree
no independent evidence
-
global similarity hypergraph
no independent evidence
Reference graph
Works this paper leans on
-
[1]
Yushi Bai, Xin Lv, Jiajie Zhang, Hongchang Lyu, Jiankai Tang, Zhidian Huang, Zhengxiao Du, Xiao Liu, Aohan Zeng, Lei Hou, et al . 2024. Longbench: A bilingual, multitask benchmark for long context understanding. InProceedings of the 62nd annual meeting of the association for computational linguistics (volume 1: Long papers). 3119–3137
work page 2024
-
[2]
2011.Schema Matching and Mapping
Zohra Bellahsene, Angela Bonifati, and Erhard Rahm (Eds.). 2011.Schema Matching and Mapping. Springer. https://doi.org/10.1007/978-3-642-16518-4
-
[3]
Philip Bohannon, Eiman Elnahrawy, Wenfei Fan, and Michael Flaster. 2006. Putting Context into Schema Matching. InProceedings of the 32nd International Conference on Very Large Data Bases (VLDB). 307–318. http://dl.acm.org/citation. cfm?id=1164155
work page 2006
-
[4]
Hong Hai Do and Erhard Rahm. 2002. COMA - A System for Flexible Combination of Schema Matching Approaches. InVery Large Data Bases Conference. https: //api.semanticscholar.org/CorpusID:9318211
work page 2002
-
[5]
AnHai Doan and Alon Y. Halevy. 2005. Semantic Integration Research in the Database Community: A Brief Survey.AI Magazine26, 1 (2005), 83–94. https: //doi.org/10.1609/aimag.v26i1.1801
-
[6]
Darren Edge, Ha Trinh, Newman Cheng, Joshua Bradley, Alex Chao, Apurva Mody, Steven Truitt, Dasha Metropolitansky, Robert Osazuwa Ness, and Jonathan Larson. 2024. From Local to Global: A Graph RAG Approach to Query-Focused Summarization. arXiv:2404.16130 [cs.CL] https://arxiv.org/abs/2404.16130
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[7]
Avigdor Gal. 2011. Uncertain schema matching: the power of not knowing. In International Conference on Information and Knowledge Management. https: //api.semanticscholar.org/CorpusID:43482147
work page 2011
- [8]
-
[9]
Zirui Guo, Lianghao Xia, Yanhua Yu, Tu Ao, and Chao Huang. 2025. LightRAG: Simple and Fast Retrieval-Augmented Generation. InFindings of the Association for Computational Linguistics: EMNLP 2025. Association for Computational Lin- guistics, Suzhou, China, 10746–10761. https://doi.org/10.18653/v1/2025.findings- emnlp.568
- [10]
-
[11]
Alistair E. W. Johnson, Tom J. Pollard, Lu Shen, Li-Wei H. Lehman, Mengling Feng, Mohammad Ghassemi, Benjamin Moody, Peter Szolovits, Leo Anthony Celi, and Roger G. Mark. 2016. MIMIC-III, a freely accessible critical care database. Scientific Data3 (2016), 160035. https://doi.org/10.1038/sdata.2016.35
-
[12]
Christos Koutras, Marios Fragkoulis, Asterios Katsifodimos, and Christoph Lofi
-
[13]
REMA: Graph Embeddings-based Relational Schema Matching.. InEdbt/icdt workshops. 17
-
[14]
Yuliang Li, Jinfeng Li, Yoshihiko Suhara, AnHai Doan, and Wang Chiew Tan
-
[15]
https://api.semanticscholar.org/CorpusID: 214743579
Deep entity matching with pre-trained language models.Proceedings of the VLDB Endowment14 (2020), 50 – 60. https://api.semanticscholar.org/CorpusID: 214743579
work page 2020
-
[16]
Nelson F Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang. 2024. Lost in the middle: How language models use long contexts.Transactions of the Association for Computational Linguistics 12 (2024), 157–173
work page 2024
-
[17]
Xuanqing Liu, Luyang Kong, Runhui Wang, Patrick Song, Austin Nevins, Hen- rik Johnson, Nimish Amlathe, and Davor Golac. 2024. GRAM: Generative Re- trieval Augmented Matching of Data Schemas in the Context of Data Security. arXiv:2406.01876 [cs.DB] https://arxiv.org/abs/2406.01876
- [18]
- [19]
-
[20]
Yu. A. Malkov and D. A. Yashunin. 2020. Efficient and Robust Approximate Nearest Neighbor Search Using Hierarchical Navigable Small World Graphs. IEEE Transactions on Pattern Analysis and Machine Intelligence42, 4 (2020), 824–
work page 2020
-
[21]
https://doi.org/10.1109/TPAMI.2018.2889473
-
[22]
J. Marc Overhage, Patrick B. Ryan, Christian G. Reich, Abraham G. Hartzema, and Paul E. Stang. 2012. Validation of a common data model for active safety surveillance research.Journal of the American Medical Informatics Association 19, 1 (2012), 54–60. https://doi.org/10.1136/amiajnl-2011-000376
- [23]
- [24]
-
[25]
Parth Sarthi, Salman Abdullah, Aditi Tuli, Shubh Khanna, Anna Goldie, and Christopher D Manning. 2024. Raptor: Recursive abstractive processing for tree-organized retrieval. InThe Twelfth International Conference on Learning Representations
work page 2024
- [26]
- [27]
-
[28]
Roee Shraga, Avigdor Gal, and Haggai Roitman. 2020. ADnEV: Cross-Domain Schema Matching using Deep Similarity Matrix Adjustment and Evaluation.Proc. VLDB Endow.13 (2020), 1401–1415. https://api.semanticscholar.org/CorpusID: 214588544
work page 2020
-
[29]
University of Michigan. 1992–. Health and Retirement Study (HRS). https: //hrs.isr.umich.edu/. Produced and distributed by the University of Michigan with funding from the National Institute on Aging (grant number NIA U01AG009740)
work page 1992
- [30]
- [31]
-
[32]
Haochen Zhang, Yuyang Dong, Chuan Xiao, and Masafumi Oyamada. 2024. Jel- lyfish: Instruction-Tuning Local Large Language Models for Data Preprocessing. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 8754–8782. 11
work page 2024
-
[33]
Jing Zhang, Bonggun Shin, Jinho D Choi, and Joyce C Ho. 2021. SMAT: An attention-based deep learning solution to the automation of schema matching. In European Conference on Advances in Databases and Information Systems. Springer, 260–274
work page 2021
-
[34]
Yu Zhang, Mei Di, Haozheng Luo, Chenwei Xu, and Richard Tzong-Han Tsai
-
[35]
arXiv preprint arXiv:2402.01685(2024)
SMUTF: Schema Matching Using Generative Tags and Hybrid Features. arXiv preprint arXiv:2402.01685(2024). 12
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.