Conceptual Schema Inference for Tabular Datasets using Large Language Models

Jiaoyan Chen; Norman W. Paton; Zhenyu Wu

arxiv: 2605.23105 · v2 · pith:UXHA3VEZnew · submitted 2026-05-21 · 💻 cs.DB

Conceptual Schema Inference for Tabular Datasets using Large Language Models

Zhenyu Wu , Jiaoyan Chen , Norman W. Paton This is my paper

Pith reviewed 2026-05-25 04:44 UTC · model grok-4.3

classification 💻 cs.DB

keywords conceptual schema inferencetabular datasetslarge language modelsdata lakesentity typesschema inferenceGeSIEmSI

0 comments

The pith

Two LLM methods infer conceptual schemas of entity types, attributes and relationships from raw tables using only headers and cell values.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Tabular collections from data lakes and open portals often contain inconsistent representations that hinder organization. The paper proposes two approaches to derive conceptual schemas automatically from raw tables. GeSI applies generative large language models to extract hierarchical types and attributes then assembles a global schema that includes cross-type relationships. EmSI instead creates embeddings of tables to group them by semantics, infers attributes per group, and builds hierarchies from shared patterns. Evaluation shows the resulting schemas are concise, structurally sound, and scale to large repositories, with a case study confirming end-to-end applicability.

Core claim

GeSI and EmSI are two large-language-model techniques that derive conceptual schemas directly from raw tables by inferring hierarchical entity types and attributes from column headers and cell values, with GeSI using generative models to integrate types into a global schema that also records relationships and EmSI using embeddings to group tables and construct hierarchies from shared attribute patterns.

What carries the argument

GeSI and EmSI: the generative-LLM and embedding-based LLM methods that turn table-level and column-level semantics into hierarchical entity types, attributes and inter-type relationships.

If this is right

Conceptual schemas can be produced scalably for large heterogeneous repositories.
Inferred schemas exhibit measurable conciseness and structural quality.
Relationships across entity types are captured without manual intervention.
End-to-end schema inference becomes feasible for web tables and open data portals.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The inferred schemas could serve as metadata that improves search and integration across data lakes.
The grouping and generation steps might be combined with existing dataset-discovery tools to reduce manual curation.
Accuracy of extracted relationships could rise further if the same models receive modest additional context from related tables.
The same column-semantics approach may extend to other semi-structured formats that contain attribute-like fields.

Load-bearing premise

Large language models can reliably extract accurate hierarchical entity types, attributes and inter-type relationships solely from column headers and cell values without domain-specific fine-tuning or additional context.

What would settle it

A benchmark collection of tables supplied with expert ground-truth schemas on which either method produces entity-type hierarchies or relationships that show low overlap with the ground truth.

Figures

Figures reproduced from arXiv: 2605.23105 by Jiaoyan Chen, Norman W. Paton, Zhenyu Wu.

**Figure 2.** Figure 2: The overall framework of GeSI while inferred types and attributes support relationship and cardinality inference. GeSI therefore constructs a schema through local-to-global steps, consolidating intermediate outputs for reuse in later stages. As shown in [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗

**Figure 4.** Figure 4: Prompt constraints for per-table type path in [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

**Figure 5.** Figure 5: Prompt for inferring column type paths The following presents a column along with its column type, surrounding table, and the table’s most specific entity type. Your task is to infer an appropriate attribute name for the given column by considering its header, sampled values, column type, contextual information, and the table’s most specific entity type. [INFERENCE CRITERIA] <EXAMPLES> Header: composer; Co… view at source ↗

**Figure 8.** Figure 8: Prompt for selecting top-K candidate top-level types in relationship inference You are given an attribute of an entity type and a list of candidate entity types. The attribute includes a name and a set of sample values. Your task is to: based on the attribute name and the sample values, select the most specific entity type that accurately describes all of the sample values. Each sample value should be a p… view at source ↗

**Figure 12.** Figure 12: Demonstration of EmSI with the input and [PITH_FULL_IMAGE:figures/full_fig_p011_12.png] view at source ↗

**Figure 13.** Figure 13: A dendrogram example with the top-level type [PITH_FULL_IMAGE:figures/full_fig_p012_13.png] view at source ↗

**Figure 16.** Figure 16: PTCS of the hierarchies with and without the LLM-based verification step on all datasets. Method WDC GDS OD Small RI Purity RI Purity RI Purity GeSI (Llama-3.1-8B) 0.978 0.892 0.993 0.842 0.992 0.867 GeSI (Qwen2.5-14B) 0.987 0.927 0.995 0.894 0.993 0.906 GeSI (GPT-3.5) 0.981 0.923 0.997 0.928 0.995 0.916 GeSI (GPT-4) 0.989 0.931 0.993 0.933 0.994 0.923 EmSI (SBERT) 0.933 0.686 0.991 0.771 0.994 0.787 EmS… view at source ↗

**Figure 17.** Figure 17: Overall runtime of GeSI (Qwen2.5-14B/ GPT-3.5) and EmSI (SwAV) on OD Large 2000 4000 6000 8000 10000 #Tables 0 500 1000 1500 2000 2500 Time (min) Type and Type Hierarchy Inference Attribute Inference Relationship Inference Total Time [PITH_FULL_IMAGE:figures/full_fig_p021_17.png] view at source ↗

**Figure 20.** Figure 20: Comparison between GeSI and EmSI in top-level types and sub-hierarchy inference. [PITH_FULL_IMAGE:figures/full_fig_p022_20.png] view at source ↗

read the original abstract

Large collections of tabular data from data lakes, web tables and open data portals often originate from heterogeneous sources, leading to representational inconsistencies. Understanding and organizing such repositories therefore remains a major challenge. While prior work has primarily focused on dataset discovery and exploration, this paper addresses the complementary problem of conceptual schema inference: automatically deriving a conceptual schema that captures entity types, attributes and inter-type relationships directly from raw tables. We propose two large language model (LLM)-based approaches that use only column headers and cell values: GeSI uses generative LLMs to infer hierarchical types and their attributes from table- and column-level semantics, and to integrate them into a global schema that also captures relationships across types; EmSI employs LLM-based table embeddings to group tables by column-level semantics, infer attributes within each group, and construct hierarchical structures from shared attribute patterns. Finally, we report an experimental analysis demonstrating the effectiveness of our approaches in terms of the conciseness and structural quality of the inferred schema components, their scalability to large repositories, and a case study illustrating end-to-end schema inference.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Two concrete LLM techniques for conceptual schema inference from tables are new here, but evaluation details are thin and no robustness steps against hallucinations are described.

read the letter

The main takeaway is that the paper introduces GeSI and EmSI as two distinct LLM-based methods for deriving conceptual schemas—entity types, attributes, and relationships—from raw table headers and values. This is a focused step past standard dataset discovery work in data lakes and open data settings. GeSI relies on generative prompting for hierarchies and cross-type links, while EmSI uses embeddings for table grouping followed by shared-attribute hierarchy construction. Both stay within the constraint of no extra domain context or fine-tuning. That combination is the actual new element. The paper does a reasonable job framing the practical problem of inconsistent tabular repositories and sketching approaches that target concise, structured output plus scalability to large collections, with a case study for end-to-end flow. Those choices show attention to real usage constraints. The soft spots are clear and proportionate. The abstract claims an experimental analysis on conciseness, structural quality, and scalability, yet supplies no metrics, baselines, or error breakdowns, so it is impossible to judge whether the effectiveness holds. The stress-test concern lands: nothing in the description indicates automated validation, prompt voting, or post-processing to catch invented subtype relations or misaligned attributes. LLMs routinely produce plausible but non-canonical structures, and without those safeguards the reported quality could be prompt-dependent rather than reliable. This is aimed at database researchers working on schema management and data integration. Someone already exploring LLM applications in that subfield would get usable method outlines and the case study, even if they would need to add their own accuracy checks. The work is coherent on its own terms and deserves a serious referee to pressure-test the experiments and suggest fixes for the robustness gap.

Referee Report

3 major / 2 minor

Summary. The paper proposes two LLM-based methods for conceptual schema inference from heterogeneous tabular data in data lakes: GeSI, which uses generative prompting to derive hierarchical entity types, attributes, and cross-type relationships from column headers and cell values, and EmSI, which uses LLM embeddings to group tables by semantics, infer attributes, and build hierarchies from shared patterns. The central claim is that these zero-shot approaches produce concise, high-quality schemas scalably, supported by experimental analysis and a case study.

Significance. If the results hold with proper validation, the work addresses a practically important gap in data integration by automating conceptual schema extraction without domain ontologies or labeled data. Strengths include the dual-method design (generative vs. embedding) and focus on scalability to large repositories. However, the zero-shot reliance on LLMs without explicit consistency mechanisms limits immediate impact until robustness is demonstrated.

major comments (3)

[Abstract / Methods] Abstract and Methods (inferred §3): The description of both GeSI and EmSI states that schemas are inferred 'directly from raw tables' using only headers and cell values, but provides no automated validation, majority voting, or post-processing to enforce invariants on hierarchies or relationships. This is load-bearing for the accuracy claim, as LLMs are known to produce inconsistent or spurious structures without such checks.
[Experimental analysis] Experimental analysis (inferred §4): The abstract claims effectiveness on 'conciseness and structural quality' and scalability, but no specific metrics (e.g., precision/recall against ground truth, inter-annotator agreement, or comparison to baselines like rule-based or supervised schema inference), error analysis, or dataset details are visible. This undermines assessment of whether reported quality reflects reliable extraction or prompt artifacts.
[Case study] Case study (inferred §5): The end-to-end illustration is presented as supporting evidence, but without quantitative comparison to manual schemas or discussion of failure modes (e.g., misaligned attributes across tables), it does not sufficiently address the central assumption of reliable extraction.

minor comments (2)

[Introduction] Clarify notation for 'hierarchical types' and 'shared attribute patterns' early in the paper to avoid ambiguity between GeSI and EmSI outputs.
[Related Work] Add references to prior LLM hallucination mitigation techniques in schema-related tasks (e.g., from NLP or data integration literature) to contextualize the approach.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the thoughtful and constructive comments. We address each major point below, indicating planned revisions to strengthen the presentation of our zero-shot LLM-based methods for conceptual schema inference.

read point-by-point responses

Referee: [Abstract / Methods] Abstract and Methods (inferred §3): The description of both GeSI and EmSI states that schemas are inferred 'directly from raw tables' using only headers and cell values, but provides no automated validation, majority voting, or post-processing to enforce invariants on hierarchies or relationships. This is load-bearing for the accuracy claim, as LLMs are known to produce inconsistent or spurious structures without such checks.

Authors: We acknowledge that our zero-shot design, which relies directly on LLM inference from headers and cell values without explicit consistency enforcement, leaves open the possibility of inconsistent hierarchies or relationships. This is a valid concern given known LLM behaviors. In the revised manuscript we will expand the Methods section to discuss this limitation explicitly and introduce lightweight post-processing (e.g., attribute-overlap-based deduplication of entity types and simple relationship validation) while preserving the zero-shot character of the core approaches. revision: yes
Referee: [Experimental analysis] Experimental analysis (inferred §4): The abstract claims effectiveness on 'conciseness and structural quality' and scalability, but no specific metrics (e.g., precision/recall against ground truth, inter-annotator agreement, or comparison to baselines like rule-based or supervised schema inference), error analysis, or dataset details are visible. This undermines assessment of whether reported quality reflects reliable extraction or prompt artifacts.

Authors: Section 4 reports dataset details (heterogeneous sources from data lakes and open portals) together with task-specific metrics: conciseness via counts of unique entity types/attributes, structural quality via hierarchy depth and cross-type relationship coverage, and scalability via runtime scaling experiments. Ground-truth schemas are rarely available at scale for such repositories, which is why we did not report precision/recall. We agree that adding an error analysis, inter-annotator agreement on a sampled subset, and comparisons against simple rule-based baselines would improve transparency; these will be included in the revision. revision: yes
Referee: [Case study] Case study (inferred §5): The end-to-end illustration is presented as supporting evidence, but without quantitative comparison to manual schemas or discussion of failure modes (e.g., misaligned attributes across tables), it does not sufficiently address the central assumption of reliable extraction.

Authors: The case study is intended as a qualitative demonstration of the full pipeline. We agree that quantitative comparison against manually constructed schemas on a limited scale and explicit discussion of observed failure modes (such as attribute misalignment) would provide stronger support. The revised §5 will incorporate these elements. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical LLM method proposal with independent evaluation

full rationale

The paper proposes two LLM-based methods (GeSI and EmSI) for conceptual schema inference from tabular data using only headers and cell values. No equations, fitted parameters, derivations, or predictions are described that could reduce to inputs by construction. Claims rest on experimental analysis of conciseness, structural quality, and scalability, which is external validation rather than self-referential. No self-citation load-bearing steps, uniqueness theorems, or ansatzes appear in the abstract or method outline. The work is self-contained as an empirical proposal.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the assumption that LLMs can perform semantic inference from raw table content; no free parameters or invented entities are mentioned.

axioms (1)

domain assumption Large language models can accurately capture entity types, attributes, and relationships from column headers and cell values alone
Both GeSI and EmSI are built directly on this premise as described in the abstract.

pith-pipeline@v0.9.0 · 5717 in / 1120 out tokens · 30985 ms · 2026-05-25T04:44:34.878575+00:00 · methodology

Conceptual Schema Inference for Tabular Datasets using Large Language Models

Core claim

What carries the argument

If this is right

Where Pith is reading between the lines

Load-bearing premise

What would settle it

discussion (0)