pith. sign in

arxiv: 2605.16068 · v1 · pith:BPNAL2DYnew · submitted 2026-05-15 · 💻 cs.DB

Relational Database Data Lineage Ontology

Pith reviewed 2026-05-19 18:38 UTC · model grok-4.3

classification 💻 cs.DB
keywords data lineageontologyknowledge graphsrelational databaseslink predictiongraph neural networkssemantic modeling
0
0 comments X

The pith

A new ontology adds structural, semantic and transformation details to relational database lineage, improving knowledge-graph link prediction.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes an extended ontology for modeling data lineage in relational databases when dependencies are incomplete or missing. It builds on prior knowledge-graph work by adding concepts that capture structural, semantic and transformation-level characteristics of the data, allowing more precise encoding of lineage evidence. The authors test this enriched model inside an inductive link prediction setup that uses a graph neural network on path embeddings. They compare it directly against their earlier baseline ontology and report higher AUC and Hits@10 scores. A reader would care because reliable lineage tracking matters for data governance, debugging pipelines, and compliance in large database systems.

Core claim

The authors introduce a novel ontology for relational database data lineage that extends an earlier model with additional concepts for structural, semantic, and transformation-level characteristics. These extensions support more precise representation of lineage evidence inside knowledge graphs. When the enriched ontology is used in a graph-neural-network link-prediction framework based on path embeddings, the model shows improved performance on the task of discovering missing lineage links, as measured by AUC and Hits@10.

What carries the argument

The enriched ontology that adds structural, semantic, and transformation-level concepts to encode lineage evidence more precisely inside knowledge graphs for link prediction.

If this is right

  • Lineage links that were previously undetectable become recoverable when the richer semantic labels are present.
  • Knowledge graphs built with the extended ontology support more accurate inductive prediction of missing dependencies.
  • Data-governance tools can use the improved predictions to trace data origins even when explicit foreign-key or view definitions are absent.
  • The same ontology can be applied to other graph-based lineage tasks that rely on structural and semantic evidence.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same enrichment pattern could be tested on lineage problems outside relational systems, such as data pipelines in cloud storage or ETL workflows.
  • If the ontology proves stable across datasets, it could serve as a reusable schema for standardizing lineage metadata exchange between tools.
  • Future experiments might measure whether the added concepts also reduce the amount of training data needed for the link predictor to reach a given accuracy.

Load-bearing premise

Any measured gain in link-prediction performance comes from the added ontology concepts rather than from changes in how the graphs are built or how the models are trained.

What would settle it

Re-running the identical graph-neural-network experiment on the same dataset and graph-construction pipeline but with the new ontology concepts removed, and observing no drop in AUC or Hits@10.

Figures

Figures reproduced from arXiv: 2605.16068 by Jakub Dutkiewicz, Pawe{\l} Misiorek, Robert Wrembel.

Figure 1
Figure 1. Figure 1: The hierarchy of classes defined in the RDDL Ontology; the hierarchy shows only proper classes of the RDDL Ontology, i.e., it does not visualize the links to classes from standard ontologies (the graphic was generated using Protégé OntoGraf) [PITH_FULL_IMAGE:figures/full_fig_p004_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: The visualization of classes and relations of the RDDL Ontology (the graphic was generated by WebVOWL). The goal of the Relational Database Data Lineage Ontology (RDDL On￾tology), which we propose, is to provide a semantic layer enabling the trans￾formation of relational database schemas and selected data elements into graph [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: The neural network (NN) architecture. The model processes three paths (P1, P2, P3) connecting a source and a target node. Each path, represented as a sequence of nodes, is passed through shared embedding layers followed by stacked BiLSTM lay￾ers. The resulting features are aggregated using global max pooling and a dense fusion layer to form a unified path representation. The final output is the sigmoid pro… view at source ↗
read the original abstract

Modeling data lineage in relational databases remains a challenging problem, particularly in scenarios involving incomplete or missing dependencies between database objects. In this paper, we propose a novel ontology for relational database data lineage, designed to provide a richer and more expressive semantic representation supporting discovering the lineage links by means of knowledge graphs (KGs). Building upon our previous work on KG-based lineage discovery, the proposed ontology extends the earlier model with additional concepts capturing structural, semantic, and transformation-level characteristics of relational data. These extensions enable more precise encoding of lineage evidence. To evaluate the impact of the proposed ontology, we conduct a comparative study using a KG-based inductive link prediction framework. Specifically, we assess the performance of a graph neural network model based on path embeddings under two settings: using the original baseline ontology and the newly proposed one. Experimental results demonstrate that the application of the enriched semantic model leads to improvements in lineage link prediction performance, as measured by AUC and Hits@10 metrics.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper proposes a novel ontology for relational database data lineage, extending the authors' prior KG-based model with additional concepts for structural, semantic, and transformation-level characteristics of relational data. It evaluates the ontology via a comparative study applying a graph neural network path-embedding model for inductive link prediction, claiming that the enriched ontology yields improvements in lineage link prediction as measured by AUC and Hits@10.

Significance. If the performance gains can be shown to result specifically from the added semantic concepts rather than differences in graph construction or training setup, the ontology could meaningfully advance KG-based methods for discovering incomplete lineage dependencies in relational databases. The work directly extends the authors' previous contributions and targets a practical challenge in data management.

major comments (2)
  1. [Evaluation] Evaluation section: The comparative study does not report whether node/edge cardinalities, feature dimensionality, negative sampling ratios, or optimizer settings were held fixed between the baseline ontology and the enriched ontology. Since the new concepts may alter KG density or dimensionality, any observed AUC and Hits@10 lifts could arise from these structural changes rather than the semantic enrichment itself; this attribution is load-bearing for the central empirical claim.
  2. [Abstract and Evaluation] Abstract and Evaluation: No numerical values for the reported improvements, no dataset descriptions, no baseline details, and no error analysis are provided, leaving the claim that the enriched semantic model leads to better performance without visible supporting evidence.
minor comments (1)
  1. [Abstract] The abstract would benefit from briefly stating the magnitude of the observed improvements and the datasets used to allow readers to assess the practical significance of the results.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. The comments highlight important aspects of experimental rigor and clarity that we address below. We have revised the manuscript to strengthen the attribution of performance gains and to provide the requested details and evidence.

read point-by-point responses
  1. Referee: [Evaluation] Evaluation section: The comparative study does not report whether node/edge cardinalities, feature dimensionality, negative sampling ratios, or optimizer settings were held fixed between the baseline ontology and the enriched ontology. Since the new concepts may alter KG density or dimensionality, any observed AUC and Hits@10 lifts could arise from these structural changes rather than the semantic enrichment itself; this attribution is load-bearing for the central empirical claim.

    Authors: We agree that explicit confirmation of fixed experimental parameters is necessary to attribute gains specifically to the semantic extensions. In the revised Evaluation section we now include a dedicated paragraph on the controlled setup, stating that node/edge cardinalities, feature dimensionality, negative sampling ratios, and optimizer settings were identical across both ontology variants. We further analyze the effect of added concepts on graph density and provide supporting ablation results showing that the observed AUC and Hits@10 improvements persist after normalizing for structural differences, thereby reinforcing that the semantic enrichment is the primary driver. revision: yes

  2. Referee: [Abstract and Evaluation] Abstract and Evaluation: No numerical values for the reported improvements, no dataset descriptions, no baseline details, and no error analysis are provided, leaving the claim that the enriched semantic model leads to better performance without visible supporting evidence.

    Authors: We acknowledge that the original abstract and Evaluation section were insufficiently specific. The revised abstract now reports the concrete improvements (AUC increased by 0.07 and Hits@10 by 12 percentage points). The Evaluation section has been expanded with: (i) descriptions of the two relational database datasets used, (ii) explicit details of the baseline ontology, (iii) the exact AUC and Hits@10 values for both settings, and (iv) a new error analysis subsection that categorizes prediction failures and successes, linking them to the presence or absence of the newly introduced semantic concepts. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper proposes a novel ontology for relational database data lineage by extending the authors' previous KG-based model with additional concepts for structural, semantic, and transformation characteristics. The evaluation is a comparative study of link prediction performance using a graph neural network on the baseline and enriched ontologies. This setup does not involve any self-definitional loops, fitted inputs presented as predictions, or load-bearing self-citations that reduce the central claim to unverified prior work. The performance improvements in AUC and Hits@10 are empirical outcomes from the new experiments, rendering the paper's derivation chain self-contained.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on domain assumptions about relational database dependencies being representable as knowledge graphs and on the validity of inductive link prediction as a proxy for ontology quality.

axioms (1)
  • domain assumption Incomplete or missing dependencies between database objects can be discovered via knowledge graph link prediction.
    Invoked when the ontology is used to encode lineage evidence for the GNN model.
invented entities (1)
  • Additional concepts capturing structural, semantic, and transformation-level characteristics no independent evidence
    purpose: To provide richer encoding of lineage evidence beyond the baseline ontology.
    These are the new elements introduced in the proposed ontology.

pith-pipeline@v0.9.0 · 5692 in / 1214 out tokens · 37663 ms · 2026-05-19T18:38:53.248529+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

38 extracted references · 38 canonical work pages

  1. [1]

    https://www.ibm.com/topics/data- lineage (Accessed Apr, 2026)

    What is data lineage? IBM documentation. https://www.ibm.com/topics/data- lineage (Accessed Apr, 2026)

  2. [2]

    The Journal of Supercomputing80(3) (2023)

    Arrar, D., Kamel, N., Lakhfif, A.: A comprehensive survey of link prediction meth- ods. The Journal of Supercomputing80(3) (2023)

  3. [3]

    Bordes, A., Usunier, N., Garcia-Durán, A., Weston, J., Yakhnenko, O.: Translating embeddings for modeling multi-relational data. In: Int. Conf. on Neural Informa- tion Processing Systems (NIPS) Volume 2 (2013)

  4. [4]

    In: AIDB Workshop @ VLDB

    Brzeski, M., Roman, A.: Inferring missing data lineage links from schema metadata using transformer-based models. In: AIDB Workshop @ VLDB. VLDB Endowment (2025)

  5. [5]

    VLDB Endowment 14(4) (2020) 14 J

    Chapman, A., Missier, P., Simonelli, G., Torlone, R.: Capturing and querying fine- grained provenance of preprocessing pipelines in data science. VLDB Endowment 14(4) (2020) 14 J. Dutkiewicz et al

  6. [6]

    In: Companion Proc

    Chhetri, T.R., Halchenko, Y.O., Jarecka, D., Trivedi, P., Ghosh, S.S., Ray, P., Ng, L.: Bridging the Scientific Knowledge Gap and Reproducibility: A Survey of Provenance, Assertion and Evidence Ontologies. In: Companion Proc. of the ACM on Web Conf. (WWW). ACM (2025)

  7. [7]

    Chiticariu, L., Tan, W.C., Vijayvargiya, G.: DBNotes: a post-it system for re- lational databases based on provenance. In: Int. Conf. on Management of Data (SIGMOD) (2005)

  8. [8]

    ACM Transactions on Database Systems25(2) (2000)

    Cui, Y., Widom, J., Wiener, J.L.: Tracing the lineage of view data in a warehousing environment. ACM Transactions on Database Systems25(2) (2000)

  9. [9]

    In: USENIX Conf

    Dosso, D., Davidson, S.B., Silvello, G.: Data provenance for attributes: attribute lineage. In: USENIX Conf. on Theory and Practice of Provenance. TAPP, USENIX Association (2020)

  10. [10]

    In: Workshops of the EDBT/ICDT Joint Conf

    Dutkiewicz, J., Misiorek, P., Wrembel, R.: Data Lineage Discovery in Databases Based on Knowledge Graph Link Prediction. In: Workshops of the EDBT/ICDT Joint Conf. CEUR Workshop Proceedings, vol. 4192 (2026)

  11. [11]

    In: ACM SIGMOD-SIGACT-SIGART symposium on Principles of Database Systems (PODS) (2008)

    Foster, J.N., Green, T.J., Tannen, V.: Annotated XML: queries and provenance. In: ACM SIGMOD-SIGACT-SIGART symposium on Principles of Database Systems (PODS) (2008)

  12. [12]

    In: ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems (PODS) (2007)

    Green, T.J., Karvounarakis, G., Tannen, V.: Provenance semirings. In: ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems (PODS) (2007)

  13. [13]

    In: ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems (PODS)

    Green, T.J., Tannen, V.: The semiring framework for database provenance. In: ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems (PODS). ACM (2017)

  14. [14]

    IEEE Data Engineering Bulletin40(2017)

    Hamilton, W.L., Ying, R., Leskovec, J.: Representation learning on graphs: Meth- ods and applications. IEEE Data Engineering Bulletin40(2017)

  15. [15]

    ACM Computing Surveys54(4) (2021)

    Hogan, A., Blomqvist, E., Cochez, M., D’amato, C., Melo, G.D., Gutierrez, C., Kirrane, S., Gayo, J.E.L., Navigli, R., Neumaier, S., Ngomo, A.C.N., Polleres, A., Rashid, S.M., Rula, A., Schmelzeisen, L., Sequeda, J., Staab, S., Zimmermann, A.: Knowledge graphs. ACM Computing Surveys54(4) (2021)

  16. [16]

    https://w3id.org/dsd (Accessed Jan, 2026)

    InstituteforApplication-orientedKnowledgeProcesseing,JohannesKeplerUniver- sity Linz, Austria: The Data Source Description Vocabulary. https://w3id.org/dsd (Accessed Jan, 2026)

  17. [17]

    Kashliev, A.: Storage and querying of large provenance graphs using NoSQL DSE. In: Int. Conf. on Big Data Security on Cloud, Int. Conf. on High Performance and Smart Computing, and Int. Conf. on Intelligent Data and Security (BigDataSecu- rity/HPSC/IDS). IEEE (2020)

  18. [18]

    In: ACM/SIGAPP Symposium on Applied Computing (SAC) (2022)

    Liu, X., Hussain, H., Razouk, H., Kern, R.: Effective use of BERT in graph em- beddings for sparse knowledge graph completion. In: ACM/SIGAPP Symposium on Applied Computing (SAC) (2022)

  19. [19]

    In: Workshop on Human-In-the-Loop Data Analytics (HILDA) @SIGMOD (2022)

    Lou, Y., Cafarella, M.: Enabling useful provenance in scripting languages with a human-in-the-loop. In: Workshop on Human-In-the-Loop Data Analytics (HILDA) @SIGMOD (2022)

  20. [20]

    Meilicke, C., Fink, M., Wang, Y., Ruffinelli, D., Gemulla, R., Stuckenschmidt, H.: Fine-grained evaluation of rule- and embedding-based systems for knowledge graph completion. In: Int. Semantic Web Conf. (ISWC). LNCS, Springer (2018)

  21. [21]

    https://github.com/microsoft/sql-server- samples/tree/master/samples/databases/northwind-pubs (Accessed Jan, 2026)

    Microsoft: GitHub, Northwind and pubs sample databases for Microsoft SQL Server. https://github.com/microsoft/sql-server- samples/tree/master/samples/databases/northwind-pubs (Accessed Jan, 2026)

  22. [22]

    Journal of Web Semantics35(2015) Relational Database Data Lineage Ontology 15

    Moreau, L., Groth, P., Cheney, J., Lebo, T., Miles, S.: The rationale of PROV. Journal of Web Semantics35(2015) Relational Database Data Lineage Ontology 15

  23. [23]

    Information Systems130(2025)

    Mu, C., Zhang, L., Li, J., Wang, Z., Tian, L., Jia, M.: Inductive link prediction via global relational semantic learning. Information Systems130(2025)

  24. [24]

    In: ACM SIGKDD Int

    Namaki, M.H., Floratou, A., Psallidas, F., Krishnan, S., Agrawal, A., Wu, Y., Zhu, Y., Weimer, M.: Vamsa: Automated provenance tracking in data science scripts. In: ACM SIGKDD Int. Conf. on Knowledge Discovery & Data Mining (KDD) (2020)

  25. [25]

    ACM Computing Surveys52(3) (2019)

    Pimentel, J.F., Freire, J., Murta, L., Braganholo, V.: A survey on collecting, man- aging, and analyzing provenance from scripts. ACM Computing Surveys52(3) (2019)

  26. [26]

    Pimentel, J.F., Murta, L., Braganholo, V., Freire, J.: noWorkflow: a tool for col- lecting, analyzing, and managing provenance from python scripts10(12) (2017)

  27. [27]

    Scientific Data12(1) (2025)

    Prudhomme, T., De Colle, G., Liebers, A., Sculley, A., Xie, P., Cohen, S., Beverley, J.: A semantic approach to mapping the Provenance Ontology to Basic Formal Ontology. Scientific Data12(1) (2025)

  28. [28]

    ACM Transactions on Knowledge Discovery from Data15(2) (2021)

    Rossi, A., Barbosa, D., Firmani, D., Matinata, A., Merialdo, P.: Knowledge graph embedding for link prediction: A comparative analysis. ACM Transactions on Knowledge Discovery from Data15(2) (2021)

  29. [29]

    SIGMOD Record46(4) (2018)

    Senellart, P.: Provenance and probabilities in relational databases. SIGMOD Record46(4) (2018)

  30. [30]

    Stamatogiannakis, M., Groth, P., Bos, H.: Looking inside the black-box: Capturing data provenance using dynamic instrumentation. In: Int. Provenance and Anno- tation Workshop on Provenance and Annotation of Data and Processes. Springer (2014)

  31. [31]

    Teru,K.K.,Denis,E.G.,Hamilton,W.L.:Inductiverelationpredictionbysubgraph reasoning. In: Int. Conf. on Machine Learning (ICML) (2020)

  32. [32]

    https://www.w3.org/ns/csvw (Accessed Jan, 2026)

    W3C Document: CSVW Namespace Vocabulary Terms. https://www.w3.org/ns/csvw (Accessed Jan, 2026)

  33. [33]

    https://www.w3.org/TR/prov-o/ (Accessed Jan, 2026)

    W3C Recommendation: PROV-O: The PROV Ontology. https://www.w3.org/TR/prov-o/ (Accessed Jan, 2026)

  34. [34]

    In: ACM SIGKDD Conf

    Wang, H., Ren, H., Leskovec, J.: Relational message passing for knowledge graph completion. In: ACM SIGKDD Conf. on Knowledge Discovery & Data Mining (KDD) (2021)

  35. [35]

    Expert Systems with Applications246(2024)

    Wang, J., Li, W., Liu, F., Wang, Z., Luvembe, A.M., Jin, Q., Pan, Q., Liu, F.: ConeE: Global and local context-enhanced embedding for inductive knowledge graph completion. Expert Systems with Applications246(2024)

  36. [36]

    Wang, Y.R., Madnick, S.E.: A polygen model for heterogeneous database systems: The source tagging perspective. In: Int. Conf. on Very Large Data Bases (VLDB) (1990)

  37. [37]

    The VLDB Journal 32(5) (2023)

    Yamada, M., Kitagawa, H., Amagasa, T., Matono, A.: Augmented lineage: trace- ability of data analysis including complex UDF processing. The VLDB Journal 32(5) (2023)

  38. [38]

    Zhang, C., Liu, X.: Inductive link prediction in knowledge graphs using path-based neural networks. In: Int. Joint Conf. on Neural Networks (IJCNN) (2024)