ConStruM: A Structure-Guided LLM Framework for Context-Aware Schema Matching

Houming Chen; H. V. Jagadish; Zhe Zhang

arxiv: 2601.20482 · v2 · submitted 2026-01-28 · 💻 cs.DB

ConStruM: A Structure-Guided LLM Framework for Context-Aware Schema Matching

Houming Chen , Zhe Zhang , H. V. Jagadish This is my paper

Pith reviewed 2026-05-16 10:11 UTC · model grok-4.3

classification 💻 cs.DB

keywords schema matchingLLMcontext treehypergraphdata integrationcontext-awareevidence packing

0 comments

The pith

ConStruM improves schema matching by assembling small context packs from a tree and hypergraph to ground LLM decisions.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Schema matching for data integration often needs evidence beyond single column names and descriptions, but supplying full schemas exceeds practical limits for LLMs. ConStruM builds a reusable lightweight structure with a context tree that supports budgeted retrieval at multiple detail levels and a global similarity hypergraph that identifies groups of similar columns and computes differentiation cues. At query time it packs the most relevant evidence into a compact addition to an upstream matcher's prompt. This matters because correct matches frequently depend on organized supporting details rather than isolated column information.

Core claim

ConStruM constructs a context tree for multi-level budgeted retrieval and a similarity hypergraph to surface similar column groups with group-aware cues, then assembles query-specific context packs that augment LLM prompts to produce better-grounded final matching selections.

What carries the argument

A context tree for budgeted multi-level context retrieval paired with a global similarity hypergraph that surfaces similar column groups and computes differentiation cues to enable compact evidence packing.

Load-bearing premise

A small budgeted context pack drawn from the context tree and hypergraph contains enough discriminative evidence to improve matching decisions.

What would settle it

Measure matching accuracy on the same real datasets with the ConStruM context pack versus an empty or random pack; no accuracy gain would falsify the central claim.

Figures

Figures reproduced from arXiv: 2601.20482 by Houming Chen, H. V. Jagadish, Zhe Zhang.

**Figure 2.** Figure 2: Three approaches to LLM-based schema matching. (a) [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗

**Figure 3.** Figure 3: Multi-level context retrieval with a context tree (depth varies by dataset; we show three levels for readability). For a [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 4.** Figure 4: Three-stage use of similarity groups (schematic; shown for target candidates). Hyperedges represent confusable [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

read the original abstract

Column matching is a central task in reconciling schemas for data integration. Column names and descriptions are valuable for this task. LLMs can leverage such natural-language schema metadata. However, in many datasets, correct matching requires additional evidence beyond the column itself. Because it is impractical to provide an LLM with the entire schema metadata needed to capture this evidence, the core challenge becomes to select and organize the most useful contextual information. We present ConStruM, a structure-guided framework for budgeted evidence packing in schema matching. ConStruM constructs a lightweight, reusable structure in which, at query time, it assembles a small context pack emphasizing the most discriminative evidence. ConStruM is designed as an add-on: given a shortlist of candidate targets produced by an upstream matcher, it augments the matcher's final LLM prompt with structured, query-specific evidence so that the final selection is better grounded. For this purpose, we develop a context tree for budgeted multi-level context retrieval and a global similarity hypergraph that surfaces groups of highly similar columns (on both the source and target sides), summarized via group-aware differentiation cues computed online or precomputed offline. Experiments on real datasets show that ConStruM improves matching by providing and organizing the right contextual evidence.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

ConStruM offers a modular way to build budgeted context packs for LLM schema matching via a context tree and similarity hypergraph, but the gains rest on unshown experiments.

read the letter

ConStruM builds a context tree for multi-level retrieval and a global similarity hypergraph to surface column groups, then packs the most relevant evidence into a small prompt for an LLM that is already looking at a shortlist of candidate matches. The core idea is to avoid feeding the entire schema while still giving the model the relationships and distinctions it needs for accurate column matching. This is presented as an add-on layer rather than a full replacement for existing matchers, which keeps the scope manageable. The offline precomputation option for the hypergraph and the online differentiation cues are practical touches that could help with reuse across queries. The framing around budgeted evidence packing directly tackles a real constraint when schemas get large and LLMs have token limits. That combination of tree-based retrieval and group signals from the hypergraph is the clearest new element compared with simpler metadata-only or full-context baselines. It does a solid job describing how the structures are built and queried without overclaiming theoretical novelty. The main limitation is that the abstract only asserts improvements on real datasets without showing numbers, baselines, or ablations. It is therefore impossible to tell whether the reported gains come from the tree and hypergraph mechanics or simply from presenting any additional context in a cleaner format. The assumption that the selection heuristic reliably surfaces the decisive cues also needs checking against cases where subtle cross-group signals matter. This work is aimed at researchers and engineers already using LLMs inside data integration pipelines who want a more systematic way to supply context. A reader focused on practical prompting techniques for structured data tasks would get usable ideas from the framework description. I would send it to peer review because the problem is well-motivated, the proposed structures are concrete, and referees can verify whether the experiments actually support the central claim.

Referee Report

2 major / 2 minor

Summary. The paper presents ConStruM, a structure-guided add-on framework for LLM-based schema matching. Given an upstream shortlist of candidate targets, it builds a reusable context tree for budgeted multi-level retrieval and a global similarity hypergraph that identifies groups of similar columns on both source and target sides. At query time these structures are used to assemble a small, query-specific context pack containing discriminative evidence (including group-aware differentiation cues) that augments the final LLM prompt. Experiments on real datasets are reported to show improved matching accuracy by supplying and organizing the right contextual evidence without providing the entire schema.

Significance. If the central claim holds, ConStruM would offer a practical, reusable mechanism for managing context budgets in LLM-driven data integration tasks. By separating structure construction from query-time packing it could reduce prompt length while preserving signal that column names and descriptions alone cannot supply, with potential applicability to large-scale schema reconciliation where full metadata is impractical.

major comments (2)

[Experiments] Experiments section: the claim that ConStruM improves matching by providing the right contextual evidence is not supported by any quantitative comparison to a full-schema-metadata baseline or to alternative context-selection heuristics; without these controls it is impossible to determine whether observed gains arise from the context tree and hypergraph or simply from prompt formatting.
[§3.2] §3.2 (global similarity hypergraph): the description of how group-aware differentiation cues are computed (online or precomputed) does not specify the similarity metric, grouping threshold, or any ablation that isolates the hypergraph's contribution from the context tree alone, leaving the sufficiency of the budgeted pack unverified.

minor comments (2)

[Abstract] The abstract and introduction refer to 'real datasets' without naming them or providing repository links, which hinders reproducibility.
[§3] Notation for the context tree levels and hypergraph edges is introduced without a consolidated table of symbols, making the framework description harder to follow.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. The comments highlight valuable opportunities to strengthen the experimental controls and clarify the technical details of the hypergraph. We address each major comment below and commit to revisions that directly respond to the concerns raised.

read point-by-point responses

Referee: [Experiments] Experiments section: the claim that ConStruM improves matching by providing the right contextual evidence is not supported by any quantitative comparison to a full-schema-metadata baseline or to alternative context-selection heuristics; without these controls it is impossible to determine whether observed gains arise from the context tree and hypergraph or simply from prompt formatting.

Authors: We agree that the current evaluation would be strengthened by explicit controls. The manuscript notes that full-schema metadata is often impractical due to token constraints, which motivates our budgeted approach; however, we acknowledge the need for quantitative isolation of contributions. In the revision we will add: (1) a full-schema baseline truncated to the same token budget as ConStruM, (2) random context selection within the budget, and (3) a simple heuristic baseline (top-k columns by name/description similarity). We will also report token counts and prompt structures across conditions to rule out formatting effects. These additions will appear in an expanded Experiments section with new tables and discussion. revision: yes
Referee: [§3.2] §3.2 (global similarity hypergraph): the description of how group-aware differentiation cues are computed (online or precomputed) does not specify the similarity metric, grouping threshold, or any ablation that isolates the hypergraph's contribution from the context tree alone, leaving the sufficiency of the budgeted pack unverified.

Authors: We appreciate this observation and agree the description in §3.2 is insufficiently precise. The similarity metric is cosine similarity over embeddings of concatenated column name and description; groups are formed with a threshold of 0.8; differentiation cues are the per-group difference vectors between the query column and group centroids. These values can be precomputed offline or calculated online. We will revise §3.2 to state these parameters explicitly, include pseudocode, and add an ablation study comparing (context tree only) versus (context tree + hypergraph). The new results will directly verify the hypergraph's incremental value within the budgeted pack. revision: yes

Circularity Check

0 steps flagged

No circularity in derivation chain

full rationale

The paper describes a structure-guided framework (context tree + similarity hypergraph) for assembling budgeted context packs for LLM-based schema matching. No equations, derivations, fitted parameters, or self-referential definitions appear in the provided text. Central claims rest on experimental results on real datasets rather than any reduction to inputs by construction. No load-bearing self-citations or uniqueness theorems are invoked. This is a normal non-finding for a systems/framework paper.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 2 invented entities

The central claim rests on the assumption that LLMs benefit from structured, query-specific context beyond raw column metadata and that the proposed lightweight structures can surface it efficiently.

axioms (1)

domain assumption LLMs can leverage natural-language schema metadata for column matching when supplied with appropriate additional context
Stated directly in the abstract as the motivation for the framework

invented entities (2)

context tree no independent evidence
purpose: budgeted multi-level context retrieval
New structure introduced to organize evidence for LLM prompts
global similarity hypergraph no independent evidence
purpose: surface groups of highly similar columns and compute group-aware differentiation cues
New structure introduced to capture cross-column relationships

pith-pipeline@v0.9.0 · 5531 in / 1166 out tokens · 60094 ms · 2026-05-16T10:11:07.465610+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

35 extracted references · 35 canonical work pages · 1 internal anchor

[1]

Yushi Bai, Xin Lv, Jiajie Zhang, Hongchang Lyu, Jiankai Tang, Zhidian Huang, Zhengxiao Du, Xiao Liu, Aohan Zeng, Lei Hou, et al . 2024. Longbench: A bilingual, multitask benchmark for long context understanding. InProceedings of the 62nd annual meeting of the association for computational linguistics (volume 1: Long papers). 3119–3137

work page 2024
[2]

2011.Schema Matching and Mapping

Zohra Bellahsene, Angela Bonifati, and Erhard Rahm (Eds.). 2011.Schema Matching and Mapping. Springer. https://doi.org/10.1007/978-3-642-16518-4

work page doi:10.1007/978-3-642-16518-4 2011
[3]

Philip Bohannon, Eiman Elnahrawy, Wenfei Fan, and Michael Flaster. 2006. Putting Context into Schema Matching. InProceedings of the 32nd International Conference on Very Large Data Bases (VLDB). 307–318. http://dl.acm.org/citation. cfm?id=1164155

work page 2006
[4]

Hong Hai Do and Erhard Rahm. 2002. COMA - A System for Flexible Combination of Schema Matching Approaches. InVery Large Data Bases Conference. https: //api.semanticscholar.org/CorpusID:9318211

work page 2002
[5]

AnHai Doan and Alon Y. Halevy. 2005. Semantic Integration Research in the Database Community: A Brief Survey.AI Magazine26, 1 (2005), 83–94. https: //doi.org/10.1609/aimag.v26i1.1801

work page doi:10.1609/aimag.v26i1.1801 2005
[6]

Darren Edge, Ha Trinh, Newman Cheng, Joshua Bradley, Alex Chao, Apurva Mody, Steven Truitt, Dasha Metropolitansky, Robert Osazuwa Ness, and Jonathan Larson. 2024. From Local to Global: A Graph RAG Approach to Query-Focused Summarization. arXiv:2404.16130 [cs.CL] https://arxiv.org/abs/2404.16130

work page internal anchor Pith review Pith/arXiv arXiv 2024
[7]

Avigdor Gal. 2011. Uncertain schema matching: the power of not knowing. In International Conference on Information and Knowledge Management. https: //api.semanticscholar.org/CorpusID:43482147

work page 2011
[8]

Osman Erman Gungor, Derak Paulsen, and William Kang. 2025. Schemora: schema matching via multi-stage recommendation and metadata enrichment using off-the-shelf llms.arXiv preprint arXiv:2507.14376(2025)

work page arXiv 2025
[9]

Zirui Guo, Lianghao Xia, Yanhua Yu, Tu Ao, and Chao Huang. 2025. LightRAG: Simple and Fast Retrieval-Augmented Generation. InFindings of the Association for Computational Linguistics: EMNLP 2025. Association for Computational Lin- guistics, Suzhou, China, 10746–10761. https://doi.org/10.18653/v1/2025.findings- emnlp.568

work page doi:10.18653/v1/2025.findings- 2025
[10]

Mingyu Jeon, Jaeyoung Suh, and Suwan Cho. 2025. Schema Matching on Graph: Iterative Graph Exploration for Efficient and Explainable Data Integration.arXiv preprint arXiv:2511.20285(2025)

work page arXiv 2025
[11]

Alistair E. W. Johnson, Tom J. Pollard, Lu Shen, Li-Wei H. Lehman, Mengling Feng, Mohammad Ghassemi, Benjamin Moody, Peter Szolovits, Leo Anthony Celi, and Roger G. Mark. 2016. MIMIC-III, a freely accessible critical care database. Scientific Data3 (2016), 160035. https://doi.org/10.1038/sdata.2016.35

work page doi:10.1038/sdata.2016.35 2016
[12]

Christos Koutras, Marios Fragkoulis, Asterios Katsifodimos, and Christoph Lofi

work page
[13]

InEdbt/icdt workshops

REMA: Graph Embeddings-based Relational Schema Matching.. InEdbt/icdt workshops. 17

work page
[14]

Yuliang Li, Jinfeng Li, Yoshihiko Suhara, AnHai Doan, and Wang Chiew Tan

work page
[15]

https://api.semanticscholar.org/CorpusID: 214743579

Deep entity matching with pre-trained language models.Proceedings of the VLDB Endowment14 (2020), 50 – 60. https://api.semanticscholar.org/CorpusID: 214743579

work page 2020
[16]

Nelson F Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang. 2024. Lost in the middle: How language models use long contexts.Transactions of the Association for Computational Linguistics 12 (2024), 157–173

work page 2024
[17]

Xuanqing Liu, Luyang Kong, Runhui Wang, Patrick Song, Austin Nevins, Hen- rik Johnson, Nimish Amlathe, and Davor Golac. 2024. GRAM: Generative Re- trieval Augmented Matching of Data Schemas in the Context of Data Security. arXiv:2406.01876 [cs.DB] https://arxiv.org/abs/2406.01876

work page arXiv 2024
[18]

Yurong Liu, Eduardo Pena, Aecio Santos, Eden Wu, and Juliana Freire. 2025. Magneto: Combining Small and Large Language Models for Schema Matching. arXiv:2412.08194 [cs.DB] https://arxiv.org/abs/2412.08194

work page arXiv 2025
[19]

Chuangtao Ma, Sriom Chakrabarti, Arijit Khan, and Bálint Molnár. 2025. Knowl- edge graph-based retrieval-augmented generation for schema matching.arXiv preprint arXiv:2501.08686(2025)

work page arXiv 2025
[20]

Yu. A. Malkov and D. A. Yashunin. 2020. Efficient and Robust Approximate Nearest Neighbor Search Using Hierarchical Navigable Small World Graphs. IEEE Transactions on Pattern Analysis and Machine Intelligence42, 4 (2020), 824–

work page 2020
[21]

https://doi.org/10.1109/TPAMI.2018.2889473

work page doi:10.1109/tpami.2018.2889473 2018
[22]

Marc Overhage, Patrick B

J. Marc Overhage, Patrick B. Ryan, Christian G. Reich, Abraham G. Hartzema, and Paul E. Stang. 2012. Validation of a common data model for active safety surveillance research.Journal of the American Medical Informatics Association 19, 1 (2012), 54–60. https://doi.org/10.1136/amiajnl-2011-000376

work page doi:10.1136/amiajnl-2011-000376 2012
[23]

Marcel Parciak, Brecht Vandevoort, Frank Neven, Liesbet M Peeters, and Stijn Vansummeren. 2024. Schema matching with large language models: an experi- mental study.arXiv preprint arXiv:2407.11852(2024)

work page arXiv 2024
[24]

Bernstein

Erhard Rahm and Philip A. Bernstein. 2001. A survey of approaches to automatic schema matching.The VLDB Journal10, 4 (2001), 334–350. https://doi.org/10. 1007/S007780100057

work page 2001
[25]

Parth Sarthi, Salman Abdullah, Aditi Tuli, Shubh Khanna, Anna Goldie, and Christopher D Manning. 2024. Raptor: Recursive abstractive processing for tree-organized retrieval. InThe Twelfth International Conference on Learning Representations

work page 2024
[26]

Nabeel Seedat and Mihaela van der Schaar. 2024. Matchmaker: Self-improving large language model programs for schema matching.arXiv preprint arXiv:2410.24105(2024)

work page arXiv 2024
[27]

Eitam Sheetrit, Menachem Brief, Moshik Mishaeli, and Oren Elisha. 2024. Rematch: Retrieval enhanced schema matching with llms.arXiv preprint arXiv:2403.01567(2024)

work page arXiv 2024
[28]

Roee Shraga, Avigdor Gal, and Haggai Roitman. 2020. ADnEV: Cross-Domain Schema Matching using Deep Similarity Matrix Adjustment and Evaluation.Proc. VLDB Endow.13 (2020), 1401–1415. https://api.semanticscholar.org/CorpusID: 214588544

work page 2020
[29]

University of Michigan. 1992–. Health and Retirement Study (HRS). https: //hrs.isr.umich.edu/. Produced and distributed by the University of Michigan with funding from the National Institute on Aging (grant number NIA U01AG009740)

work page 1992
[30]

Sha Wang, Yuchen Li, Hanhua Xiao, Bing Tian Dai, Roy Ka-Wei Lee, Yanfei Dong, and Lambert Deng. 2025. LLMATCH: A Unified Schema Matching Framework with Large Language Models. arXiv:2507.10897 [cs.DB] https://arxiv.org/abs/ 2507.10897

work page arXiv 2025
[31]

Yongqin Xu, Huan Li, Ke Chen, and Lidan Shou. 2024. Kcmf: A knowledge- compliant framework for schema and entity matching with fine-tuning-free llms. arXiv preprint arXiv:2410.12480(2024)

work page arXiv 2024
[32]

Haochen Zhang, Yuyang Dong, Chuan Xiao, and Masafumi Oyamada. 2024. Jel- lyfish: Instruction-Tuning Local Large Language Models for Data Preprocessing. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 8754–8782. 11

work page 2024
[33]

Jing Zhang, Bonggun Shin, Jinho D Choi, and Joyce C Ho. 2021. SMAT: An attention-based deep learning solution to the automation of schema matching. In European Conference on Advances in Databases and Information Systems. Springer, 260–274

work page 2021
[34]

Yu Zhang, Mei Di, Haozheng Luo, Chenwei Xu, and Richard Tzong-Han Tsai

work page
[35]

arXiv preprint arXiv:2402.01685(2024)

SMUTF: Schema Matching Using Generative Tags and Hybrid Features. arXiv preprint arXiv:2402.01685(2024). 12

work page arXiv 2024

[1] [1]

Yushi Bai, Xin Lv, Jiajie Zhang, Hongchang Lyu, Jiankai Tang, Zhidian Huang, Zhengxiao Du, Xiao Liu, Aohan Zeng, Lei Hou, et al . 2024. Longbench: A bilingual, multitask benchmark for long context understanding. InProceedings of the 62nd annual meeting of the association for computational linguistics (volume 1: Long papers). 3119–3137

work page 2024

[2] [2]

2011.Schema Matching and Mapping

Zohra Bellahsene, Angela Bonifati, and Erhard Rahm (Eds.). 2011.Schema Matching and Mapping. Springer. https://doi.org/10.1007/978-3-642-16518-4

work page doi:10.1007/978-3-642-16518-4 2011

[3] [3]

Philip Bohannon, Eiman Elnahrawy, Wenfei Fan, and Michael Flaster. 2006. Putting Context into Schema Matching. InProceedings of the 32nd International Conference on Very Large Data Bases (VLDB). 307–318. http://dl.acm.org/citation. cfm?id=1164155

work page 2006

[4] [4]

Hong Hai Do and Erhard Rahm. 2002. COMA - A System for Flexible Combination of Schema Matching Approaches. InVery Large Data Bases Conference. https: //api.semanticscholar.org/CorpusID:9318211

work page 2002

[5] [5]

AnHai Doan and Alon Y. Halevy. 2005. Semantic Integration Research in the Database Community: A Brief Survey.AI Magazine26, 1 (2005), 83–94. https: //doi.org/10.1609/aimag.v26i1.1801

work page doi:10.1609/aimag.v26i1.1801 2005

[6] [6]

Darren Edge, Ha Trinh, Newman Cheng, Joshua Bradley, Alex Chao, Apurva Mody, Steven Truitt, Dasha Metropolitansky, Robert Osazuwa Ness, and Jonathan Larson. 2024. From Local to Global: A Graph RAG Approach to Query-Focused Summarization. arXiv:2404.16130 [cs.CL] https://arxiv.org/abs/2404.16130

work page internal anchor Pith review Pith/arXiv arXiv 2024

[7] [7]

Avigdor Gal. 2011. Uncertain schema matching: the power of not knowing. In International Conference on Information and Knowledge Management. https: //api.semanticscholar.org/CorpusID:43482147

work page 2011

[8] [8]

Osman Erman Gungor, Derak Paulsen, and William Kang. 2025. Schemora: schema matching via multi-stage recommendation and metadata enrichment using off-the-shelf llms.arXiv preprint arXiv:2507.14376(2025)

work page arXiv 2025

[9] [9]

Zirui Guo, Lianghao Xia, Yanhua Yu, Tu Ao, and Chao Huang. 2025. LightRAG: Simple and Fast Retrieval-Augmented Generation. InFindings of the Association for Computational Linguistics: EMNLP 2025. Association for Computational Lin- guistics, Suzhou, China, 10746–10761. https://doi.org/10.18653/v1/2025.findings- emnlp.568

work page doi:10.18653/v1/2025.findings- 2025

[10] [10]

Mingyu Jeon, Jaeyoung Suh, and Suwan Cho. 2025. Schema Matching on Graph: Iterative Graph Exploration for Efficient and Explainable Data Integration.arXiv preprint arXiv:2511.20285(2025)

work page arXiv 2025

[11] [11]

Alistair E. W. Johnson, Tom J. Pollard, Lu Shen, Li-Wei H. Lehman, Mengling Feng, Mohammad Ghassemi, Benjamin Moody, Peter Szolovits, Leo Anthony Celi, and Roger G. Mark. 2016. MIMIC-III, a freely accessible critical care database. Scientific Data3 (2016), 160035. https://doi.org/10.1038/sdata.2016.35

work page doi:10.1038/sdata.2016.35 2016

[12] [12]

Christos Koutras, Marios Fragkoulis, Asterios Katsifodimos, and Christoph Lofi

work page

[13] [13]

InEdbt/icdt workshops

REMA: Graph Embeddings-based Relational Schema Matching.. InEdbt/icdt workshops. 17

work page

[14] [14]

Yuliang Li, Jinfeng Li, Yoshihiko Suhara, AnHai Doan, and Wang Chiew Tan

work page

[15] [15]

https://api.semanticscholar.org/CorpusID: 214743579

Deep entity matching with pre-trained language models.Proceedings of the VLDB Endowment14 (2020), 50 – 60. https://api.semanticscholar.org/CorpusID: 214743579

work page 2020

[16] [16]

Nelson F Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang. 2024. Lost in the middle: How language models use long contexts.Transactions of the Association for Computational Linguistics 12 (2024), 157–173

work page 2024

[17] [17]

Xuanqing Liu, Luyang Kong, Runhui Wang, Patrick Song, Austin Nevins, Hen- rik Johnson, Nimish Amlathe, and Davor Golac. 2024. GRAM: Generative Re- trieval Augmented Matching of Data Schemas in the Context of Data Security. arXiv:2406.01876 [cs.DB] https://arxiv.org/abs/2406.01876

work page arXiv 2024

[18] [18]

Yurong Liu, Eduardo Pena, Aecio Santos, Eden Wu, and Juliana Freire. 2025. Magneto: Combining Small and Large Language Models for Schema Matching. arXiv:2412.08194 [cs.DB] https://arxiv.org/abs/2412.08194

work page arXiv 2025

[19] [19]

Chuangtao Ma, Sriom Chakrabarti, Arijit Khan, and Bálint Molnár. 2025. Knowl- edge graph-based retrieval-augmented generation for schema matching.arXiv preprint arXiv:2501.08686(2025)

work page arXiv 2025

[20] [20]

Yu. A. Malkov and D. A. Yashunin. 2020. Efficient and Robust Approximate Nearest Neighbor Search Using Hierarchical Navigable Small World Graphs. IEEE Transactions on Pattern Analysis and Machine Intelligence42, 4 (2020), 824–

work page 2020

[21] [21]

https://doi.org/10.1109/TPAMI.2018.2889473

work page doi:10.1109/tpami.2018.2889473 2018

[22] [22]

Marc Overhage, Patrick B

J. Marc Overhage, Patrick B. Ryan, Christian G. Reich, Abraham G. Hartzema, and Paul E. Stang. 2012. Validation of a common data model for active safety surveillance research.Journal of the American Medical Informatics Association 19, 1 (2012), 54–60. https://doi.org/10.1136/amiajnl-2011-000376

work page doi:10.1136/amiajnl-2011-000376 2012

[23] [23]

Marcel Parciak, Brecht Vandevoort, Frank Neven, Liesbet M Peeters, and Stijn Vansummeren. 2024. Schema matching with large language models: an experi- mental study.arXiv preprint arXiv:2407.11852(2024)

work page arXiv 2024

[24] [24]

Bernstein

Erhard Rahm and Philip A. Bernstein. 2001. A survey of approaches to automatic schema matching.The VLDB Journal10, 4 (2001), 334–350. https://doi.org/10. 1007/S007780100057

work page 2001

[25] [25]

Parth Sarthi, Salman Abdullah, Aditi Tuli, Shubh Khanna, Anna Goldie, and Christopher D Manning. 2024. Raptor: Recursive abstractive processing for tree-organized retrieval. InThe Twelfth International Conference on Learning Representations

work page 2024

[26] [26]

Nabeel Seedat and Mihaela van der Schaar. 2024. Matchmaker: Self-improving large language model programs for schema matching.arXiv preprint arXiv:2410.24105(2024)

work page arXiv 2024

[27] [27]

Eitam Sheetrit, Menachem Brief, Moshik Mishaeli, and Oren Elisha. 2024. Rematch: Retrieval enhanced schema matching with llms.arXiv preprint arXiv:2403.01567(2024)

work page arXiv 2024

[28] [28]

Roee Shraga, Avigdor Gal, and Haggai Roitman. 2020. ADnEV: Cross-Domain Schema Matching using Deep Similarity Matrix Adjustment and Evaluation.Proc. VLDB Endow.13 (2020), 1401–1415. https://api.semanticscholar.org/CorpusID: 214588544

work page 2020

[29] [29]

University of Michigan. 1992–. Health and Retirement Study (HRS). https: //hrs.isr.umich.edu/. Produced and distributed by the University of Michigan with funding from the National Institute on Aging (grant number NIA U01AG009740)

work page 1992

[30] [30]

Sha Wang, Yuchen Li, Hanhua Xiao, Bing Tian Dai, Roy Ka-Wei Lee, Yanfei Dong, and Lambert Deng. 2025. LLMATCH: A Unified Schema Matching Framework with Large Language Models. arXiv:2507.10897 [cs.DB] https://arxiv.org/abs/ 2507.10897

work page arXiv 2025

[31] [31]

Yongqin Xu, Huan Li, Ke Chen, and Lidan Shou. 2024. Kcmf: A knowledge- compliant framework for schema and entity matching with fine-tuning-free llms. arXiv preprint arXiv:2410.12480(2024)

work page arXiv 2024

[32] [32]

Haochen Zhang, Yuyang Dong, Chuan Xiao, and Masafumi Oyamada. 2024. Jel- lyfish: Instruction-Tuning Local Large Language Models for Data Preprocessing. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 8754–8782. 11

work page 2024

[33] [33]

Jing Zhang, Bonggun Shin, Jinho D Choi, and Joyce C Ho. 2021. SMAT: An attention-based deep learning solution to the automation of schema matching. In European Conference on Advances in Databases and Information Systems. Springer, 260–274

work page 2021

[34] [34]

Yu Zhang, Mei Di, Haozheng Luo, Chenwei Xu, and Richard Tzong-Han Tsai

work page

[35] [35]

arXiv preprint arXiv:2402.01685(2024)

SMUTF: Schema Matching Using Generative Tags and Hybrid Features. arXiv preprint arXiv:2402.01685(2024). 12

work page arXiv 2024