pith. sign in

arxiv: 2601.20482 · v2 · submitted 2026-01-28 · 💻 cs.DB

ConStruM: A Structure-Guided LLM Framework for Context-Aware Schema Matching

Pith reviewed 2026-05-16 10:11 UTC · model grok-4.3

classification 💻 cs.DB
keywords schema matchingLLMcontext treehypergraphdata integrationcontext-awareevidence packing
0
0 comments X

The pith

ConStruM improves schema matching by assembling small context packs from a tree and hypergraph to ground LLM decisions.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Schema matching for data integration often needs evidence beyond single column names and descriptions, but supplying full schemas exceeds practical limits for LLMs. ConStruM builds a reusable lightweight structure with a context tree that supports budgeted retrieval at multiple detail levels and a global similarity hypergraph that identifies groups of similar columns and computes differentiation cues. At query time it packs the most relevant evidence into a compact addition to an upstream matcher's prompt. This matters because correct matches frequently depend on organized supporting details rather than isolated column information.

Core claim

ConStruM constructs a context tree for multi-level budgeted retrieval and a similarity hypergraph to surface similar column groups with group-aware cues, then assembles query-specific context packs that augment LLM prompts to produce better-grounded final matching selections.

What carries the argument

A context tree for budgeted multi-level context retrieval paired with a global similarity hypergraph that surfaces similar column groups and computes differentiation cues to enable compact evidence packing.

Load-bearing premise

A small budgeted context pack drawn from the context tree and hypergraph contains enough discriminative evidence to improve matching decisions.

What would settle it

Measure matching accuracy on the same real datasets with the ConStruM context pack versus an empty or random pack; no accuracy gain would falsify the central claim.

Figures

Figures reproduced from arXiv: 2601.20482 by Houming Chen, H. V. Jagadish, Zhe Zhang.

Figure 1
Figure 1. Figure 1: A context-critical schema-matching example from [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Three approaches to LLM-based schema matching. (a) [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Multi-level context retrieval with a context tree (depth varies by dataset; we show three levels for readability). For a [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Three-stage use of similarity groups (schematic; shown for target candidates). Hyperedges represent confusable [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗
read the original abstract

Column matching is a central task in reconciling schemas for data integration. Column names and descriptions are valuable for this task. LLMs can leverage such natural-language schema metadata. However, in many datasets, correct matching requires additional evidence beyond the column itself. Because it is impractical to provide an LLM with the entire schema metadata needed to capture this evidence, the core challenge becomes to select and organize the most useful contextual information. We present ConStruM, a structure-guided framework for budgeted evidence packing in schema matching. ConStruM constructs a lightweight, reusable structure in which, at query time, it assembles a small context pack emphasizing the most discriminative evidence. ConStruM is designed as an add-on: given a shortlist of candidate targets produced by an upstream matcher, it augments the matcher's final LLM prompt with structured, query-specific evidence so that the final selection is better grounded. For this purpose, we develop a context tree for budgeted multi-level context retrieval and a global similarity hypergraph that surfaces groups of highly similar columns (on both the source and target sides), summarized via group-aware differentiation cues computed online or precomputed offline. Experiments on real datasets show that ConStruM improves matching by providing and organizing the right contextual evidence.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper presents ConStruM, a structure-guided add-on framework for LLM-based schema matching. Given an upstream shortlist of candidate targets, it builds a reusable context tree for budgeted multi-level retrieval and a global similarity hypergraph that identifies groups of similar columns on both source and target sides. At query time these structures are used to assemble a small, query-specific context pack containing discriminative evidence (including group-aware differentiation cues) that augments the final LLM prompt. Experiments on real datasets are reported to show improved matching accuracy by supplying and organizing the right contextual evidence without providing the entire schema.

Significance. If the central claim holds, ConStruM would offer a practical, reusable mechanism for managing context budgets in LLM-driven data integration tasks. By separating structure construction from query-time packing it could reduce prompt length while preserving signal that column names and descriptions alone cannot supply, with potential applicability to large-scale schema reconciliation where full metadata is impractical.

major comments (2)
  1. [Experiments] Experiments section: the claim that ConStruM improves matching by providing the right contextual evidence is not supported by any quantitative comparison to a full-schema-metadata baseline or to alternative context-selection heuristics; without these controls it is impossible to determine whether observed gains arise from the context tree and hypergraph or simply from prompt formatting.
  2. [§3.2] §3.2 (global similarity hypergraph): the description of how group-aware differentiation cues are computed (online or precomputed) does not specify the similarity metric, grouping threshold, or any ablation that isolates the hypergraph's contribution from the context tree alone, leaving the sufficiency of the budgeted pack unverified.
minor comments (2)
  1. [Abstract] The abstract and introduction refer to 'real datasets' without naming them or providing repository links, which hinders reproducibility.
  2. [§3] Notation for the context tree levels and hypergraph edges is introduced without a consolidated table of symbols, making the framework description harder to follow.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. The comments highlight valuable opportunities to strengthen the experimental controls and clarify the technical details of the hypergraph. We address each major comment below and commit to revisions that directly respond to the concerns raised.

read point-by-point responses
  1. Referee: [Experiments] Experiments section: the claim that ConStruM improves matching by providing the right contextual evidence is not supported by any quantitative comparison to a full-schema-metadata baseline or to alternative context-selection heuristics; without these controls it is impossible to determine whether observed gains arise from the context tree and hypergraph or simply from prompt formatting.

    Authors: We agree that the current evaluation would be strengthened by explicit controls. The manuscript notes that full-schema metadata is often impractical due to token constraints, which motivates our budgeted approach; however, we acknowledge the need for quantitative isolation of contributions. In the revision we will add: (1) a full-schema baseline truncated to the same token budget as ConStruM, (2) random context selection within the budget, and (3) a simple heuristic baseline (top-k columns by name/description similarity). We will also report token counts and prompt structures across conditions to rule out formatting effects. These additions will appear in an expanded Experiments section with new tables and discussion. revision: yes

  2. Referee: [§3.2] §3.2 (global similarity hypergraph): the description of how group-aware differentiation cues are computed (online or precomputed) does not specify the similarity metric, grouping threshold, or any ablation that isolates the hypergraph's contribution from the context tree alone, leaving the sufficiency of the budgeted pack unverified.

    Authors: We appreciate this observation and agree the description in §3.2 is insufficiently precise. The similarity metric is cosine similarity over embeddings of concatenated column name and description; groups are formed with a threshold of 0.8; differentiation cues are the per-group difference vectors between the query column and group centroids. These values can be precomputed offline or calculated online. We will revise §3.2 to state these parameters explicitly, include pseudocode, and add an ablation study comparing (context tree only) versus (context tree + hypergraph). The new results will directly verify the hypergraph's incremental value within the budgeted pack. revision: yes

Circularity Check

0 steps flagged

No circularity in derivation chain

full rationale

The paper describes a structure-guided framework (context tree + similarity hypergraph) for assembling budgeted context packs for LLM-based schema matching. No equations, derivations, fitted parameters, or self-referential definitions appear in the provided text. Central claims rest on experimental results on real datasets rather than any reduction to inputs by construction. No load-bearing self-citations or uniqueness theorems are invoked. This is a normal non-finding for a systems/framework paper.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 2 invented entities

The central claim rests on the assumption that LLMs benefit from structured, query-specific context beyond raw column metadata and that the proposed lightweight structures can surface it efficiently.

axioms (1)
  • domain assumption LLMs can leverage natural-language schema metadata for column matching when supplied with appropriate additional context
    Stated directly in the abstract as the motivation for the framework
invented entities (2)
  • context tree no independent evidence
    purpose: budgeted multi-level context retrieval
    New structure introduced to organize evidence for LLM prompts
  • global similarity hypergraph no independent evidence
    purpose: surface groups of highly similar columns and compute group-aware differentiation cues
    New structure introduced to capture cross-column relationships

pith-pipeline@v0.9.0 · 5531 in / 1166 out tokens · 60094 ms · 2026-05-16T10:11:07.465610+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

35 extracted references · 35 canonical work pages · 1 internal anchor

  1. [1]

    Yushi Bai, Xin Lv, Jiajie Zhang, Hongchang Lyu, Jiankai Tang, Zhidian Huang, Zhengxiao Du, Xiao Liu, Aohan Zeng, Lei Hou, et al . 2024. Longbench: A bilingual, multitask benchmark for long context understanding. InProceedings of the 62nd annual meeting of the association for computational linguistics (volume 1: Long papers). 3119–3137

  2. [2]

    2011.Schema Matching and Mapping

    Zohra Bellahsene, Angela Bonifati, and Erhard Rahm (Eds.). 2011.Schema Matching and Mapping. Springer. https://doi.org/10.1007/978-3-642-16518-4

  3. [3]

    Philip Bohannon, Eiman Elnahrawy, Wenfei Fan, and Michael Flaster. 2006. Putting Context into Schema Matching. InProceedings of the 32nd International Conference on Very Large Data Bases (VLDB). 307–318. http://dl.acm.org/citation. cfm?id=1164155

  4. [4]

    Hong Hai Do and Erhard Rahm. 2002. COMA - A System for Flexible Combination of Schema Matching Approaches. InVery Large Data Bases Conference. https: //api.semanticscholar.org/CorpusID:9318211

  5. [5]

    AnHai Doan and Alon Y. Halevy. 2005. Semantic Integration Research in the Database Community: A Brief Survey.AI Magazine26, 1 (2005), 83–94. https: //doi.org/10.1609/aimag.v26i1.1801

  6. [6]

    Darren Edge, Ha Trinh, Newman Cheng, Joshua Bradley, Alex Chao, Apurva Mody, Steven Truitt, Dasha Metropolitansky, Robert Osazuwa Ness, and Jonathan Larson. 2024. From Local to Global: A Graph RAG Approach to Query-Focused Summarization. arXiv:2404.16130 [cs.CL] https://arxiv.org/abs/2404.16130

  7. [7]

    Avigdor Gal. 2011. Uncertain schema matching: the power of not knowing. In International Conference on Information and Knowledge Management. https: //api.semanticscholar.org/CorpusID:43482147

  8. [8]

    Osman Erman Gungor, Derak Paulsen, and William Kang. 2025. Schemora: schema matching via multi-stage recommendation and metadata enrichment using off-the-shelf llms.arXiv preprint arXiv:2507.14376(2025)

  9. [9]

    Zirui Guo, Lianghao Xia, Yanhua Yu, Tu Ao, and Chao Huang. 2025. LightRAG: Simple and Fast Retrieval-Augmented Generation. InFindings of the Association for Computational Linguistics: EMNLP 2025. Association for Computational Lin- guistics, Suzhou, China, 10746–10761. https://doi.org/10.18653/v1/2025.findings- emnlp.568

  10. [10]

    Mingyu Jeon, Jaeyoung Suh, and Suwan Cho. 2025. Schema Matching on Graph: Iterative Graph Exploration for Efficient and Explainable Data Integration.arXiv preprint arXiv:2511.20285(2025)

  11. [11]

    Alistair E. W. Johnson, Tom J. Pollard, Lu Shen, Li-Wei H. Lehman, Mengling Feng, Mohammad Ghassemi, Benjamin Moody, Peter Szolovits, Leo Anthony Celi, and Roger G. Mark. 2016. MIMIC-III, a freely accessible critical care database. Scientific Data3 (2016), 160035. https://doi.org/10.1038/sdata.2016.35

  12. [12]

    Christos Koutras, Marios Fragkoulis, Asterios Katsifodimos, and Christoph Lofi

  13. [13]

    InEdbt/icdt workshops

    REMA: Graph Embeddings-based Relational Schema Matching.. InEdbt/icdt workshops. 17

  14. [14]

    Yuliang Li, Jinfeng Li, Yoshihiko Suhara, AnHai Doan, and Wang Chiew Tan

  15. [15]

    https://api.semanticscholar.org/CorpusID: 214743579

    Deep entity matching with pre-trained language models.Proceedings of the VLDB Endowment14 (2020), 50 – 60. https://api.semanticscholar.org/CorpusID: 214743579

  16. [16]

    Nelson F Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang. 2024. Lost in the middle: How language models use long contexts.Transactions of the Association for Computational Linguistics 12 (2024), 157–173

  17. [17]

    Xuanqing Liu, Luyang Kong, Runhui Wang, Patrick Song, Austin Nevins, Hen- rik Johnson, Nimish Amlathe, and Davor Golac. 2024. GRAM: Generative Re- trieval Augmented Matching of Data Schemas in the Context of Data Security. arXiv:2406.01876 [cs.DB] https://arxiv.org/abs/2406.01876

  18. [18]

    Yurong Liu, Eduardo Pena, Aecio Santos, Eden Wu, and Juliana Freire. 2025. Magneto: Combining Small and Large Language Models for Schema Matching. arXiv:2412.08194 [cs.DB] https://arxiv.org/abs/2412.08194

  19. [19]

    Chuangtao Ma, Sriom Chakrabarti, Arijit Khan, and Bálint Molnár. 2025. Knowl- edge graph-based retrieval-augmented generation for schema matching.arXiv preprint arXiv:2501.08686(2025)

  20. [20]

    Yu. A. Malkov and D. A. Yashunin. 2020. Efficient and Robust Approximate Nearest Neighbor Search Using Hierarchical Navigable Small World Graphs. IEEE Transactions on Pattern Analysis and Machine Intelligence42, 4 (2020), 824–

  21. [21]

    https://doi.org/10.1109/TPAMI.2018.2889473

  22. [22]

    Marc Overhage, Patrick B

    J. Marc Overhage, Patrick B. Ryan, Christian G. Reich, Abraham G. Hartzema, and Paul E. Stang. 2012. Validation of a common data model for active safety surveillance research.Journal of the American Medical Informatics Association 19, 1 (2012), 54–60. https://doi.org/10.1136/amiajnl-2011-000376

  23. [23]

    Marcel Parciak, Brecht Vandevoort, Frank Neven, Liesbet M Peeters, and Stijn Vansummeren. 2024. Schema matching with large language models: an experi- mental study.arXiv preprint arXiv:2407.11852(2024)

  24. [24]

    Bernstein

    Erhard Rahm and Philip A. Bernstein. 2001. A survey of approaches to automatic schema matching.The VLDB Journal10, 4 (2001), 334–350. https://doi.org/10. 1007/S007780100057

  25. [25]

    Parth Sarthi, Salman Abdullah, Aditi Tuli, Shubh Khanna, Anna Goldie, and Christopher D Manning. 2024. Raptor: Recursive abstractive processing for tree-organized retrieval. InThe Twelfth International Conference on Learning Representations

  26. [26]

    Nabeel Seedat and Mihaela van der Schaar. 2024. Matchmaker: Self-improving large language model programs for schema matching.arXiv preprint arXiv:2410.24105(2024)

  27. [27]

    Eitam Sheetrit, Menachem Brief, Moshik Mishaeli, and Oren Elisha. 2024. Rematch: Retrieval enhanced schema matching with llms.arXiv preprint arXiv:2403.01567(2024)

  28. [28]

    Roee Shraga, Avigdor Gal, and Haggai Roitman. 2020. ADnEV: Cross-Domain Schema Matching using Deep Similarity Matrix Adjustment and Evaluation.Proc. VLDB Endow.13 (2020), 1401–1415. https://api.semanticscholar.org/CorpusID: 214588544

  29. [29]

    University of Michigan. 1992–. Health and Retirement Study (HRS). https: //hrs.isr.umich.edu/. Produced and distributed by the University of Michigan with funding from the National Institute on Aging (grant number NIA U01AG009740)

  30. [30]

    Sha Wang, Yuchen Li, Hanhua Xiao, Bing Tian Dai, Roy Ka-Wei Lee, Yanfei Dong, and Lambert Deng. 2025. LLMATCH: A Unified Schema Matching Framework with Large Language Models. arXiv:2507.10897 [cs.DB] https://arxiv.org/abs/ 2507.10897

  31. [31]

    Yongqin Xu, Huan Li, Ke Chen, and Lidan Shou. 2024. Kcmf: A knowledge- compliant framework for schema and entity matching with fine-tuning-free llms. arXiv preprint arXiv:2410.12480(2024)

  32. [32]

    Haochen Zhang, Yuyang Dong, Chuan Xiao, and Masafumi Oyamada. 2024. Jel- lyfish: Instruction-Tuning Local Large Language Models for Data Preprocessing. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 8754–8782. 11

  33. [33]

    Jing Zhang, Bonggun Shin, Jinho D Choi, and Joyce C Ho. 2021. SMAT: An attention-based deep learning solution to the automation of schema matching. In European Conference on Advances in Databases and Information Systems. Springer, 260–274

  34. [34]

    Yu Zhang, Mei Di, Haozheng Luo, Chenwei Xu, and Richard Tzong-Han Tsai

  35. [35]

    arXiv preprint arXiv:2402.01685(2024)

    SMUTF: Schema Matching Using Generative Tags and Hybrid Features. arXiv preprint arXiv:2402.01685(2024). 12