pith. sign in

arxiv: 2603.02995 · v2 · submitted 2026-03-03 · 💻 cs.DB

A Graph-Native Approach to Normalization

Pith reviewed 2026-05-15 16:43 UTC · model grok-4.3

classification 💻 cs.DB
keywords graph normalizationfunctional dependencieslabeled property graphsknowledge graphsdata qualitynormal formsgraph databases
0
0 comments X

The pith

Graph-native normalization reduces redundancies in labeled property graphs by defining functional dependencies across nodes, edges, and their combinations.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Labeled property graphs in knowledge graph applications often contain redundancies that lead to inconsistencies because they lack rigid schemas. Earlier work handled functional dependencies only inside nodes, yet real graphs also show dependencies that cross edges. The paper defines graph-native normal forms and graph object functional dependencies that treat nodes, edges, and their joint patterns together. It supplies algorithms that transform a graph into these normal forms while preserving its information. Experiments on synthetic and real graph datasets confirm that the transformations can be carried out in practice.

Core claim

The central claim is that normalization can be lifted from the relational setting to graphs by introducing graph object functional dependencies that capture constraints inside nodes, inside edges, and between nodes and edges, then using these to define a family of graph-native normal forms together with algorithms that rewrite a graph to satisfy the chosen form.

What carries the argument

Graph object functional dependencies (GOFDs) together with the associated graph-native normal forms, which identify and remove redundant data patterns involving nodes, edges, and node-edge combinations.

If this is right

  • Graphs can be systematically rewritten into higher normal forms that eliminate node-edge redundancies.
  • The same dependency notion supports algorithms that decide whether a given graph already satisfies a chosen normal form.
  • Knowledge-graph applications obtain data with fewer update and deletion anomalies after the transformation.
  • Normalization decisions can now be made on the combined structure of nodes and edges rather than nodes alone.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The approach opens a route to automatic normalization tools inside existing graph database systems.
  • Similar dependency-based rewriting might be applied to other graph models such as RDF or hypergraphs.
  • Normalized graphs could improve the accuracy of graph-query optimizers that rely on dependency information.
  • The framework supplies a measurable way to compare the quality of different graph datasets before and after cleaning.

Load-bearing premise

Functional dependencies that involve edges in real labeled property graphs can be discovered and used for normalization in the same direct way as relational functional dependencies without creating fresh inconsistencies.

What would settle it

A concrete counter-example would be a labeled property graph on which the proposed transformation algorithms either lose information that should be preserved or introduce new anomalies that were absent in the original data.

Figures

Figures reproduced from arXiv: 2603.02995 by Johannes Schrott, Katja Hose, Maxime Jakubowski.

Figure 1
Figure 1. Figure 1: Examples of denormalized LPGs describing courses, lecturers, and students. Redundancies are ............... underlined....... with ...... dots. :Course n1 title: “Database Systems” language: “English” .............. usingBook: . ......... “Alice” :Lecturer n3 :teaches e2 at: 2026-02-16 name: “Maxime” :Lecturer n2 :teaches e1 at: 2026-02-09 name: “Johannes” :Lecturer n4 :teaches e3 at: 2026-02-23 name: “Kat… view at source ↗
Figure 4
Figure 4. Figure 4: Resolving redundancies in Figure 1 related to [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: GO-FDs of different shapes for which no redun [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Example graph for the calculation of per [PITH_FULL_IMAGE:figures/full_fig_p011_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Percentual changes (log scale) to our structural per-graph metrics following graph-native [PITH_FULL_IMAGE:figures/full_fig_p012_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Comparison of query performance (𝑚Q.DBHit) be￾tween original and rewritten queries the average number of properties but, due to reification, create substantially more new nodes and edges. Overall, Figure 7a shows that graph-native normalization almost always increases the number of graph objects and decreases the av￾erage number of properties per object. The exception is𝜓between-n-ep, which “moves” a prope… view at source ↗
read the original abstract

In recent years, knowledge graphs (KGs) - in particular in the form of labeled property graphs (LPGs) - have become essential components in a broad range of applications. Although the absence of strict schemas for KGs facilitates structural issues that lead to redundancies and subsequently to inconsistencies and anomalies, the problem of KG quality has so far received only little attention. Inspired by normalization using functional dependencies for relational data, a first approach exploiting dependencies within nodes has been proposed. However, real-world KGs also expose functional dependencies involving edges. In this paper, we therefore propose graph-native normalization, which considers dependencies within nodes, edges, and their combination. We define a range of graph-native normal forms and graph object functional dependencies and propose algorithms for transforming graphs accordingly. We evaluate our contributions using a broad range of synthetic and native graph datasets.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 1 minor

Summary. The paper proposes graph-native normalization for labeled property graphs (LPGs) in knowledge graphs. It extends relational normalization by introducing graph object functional dependencies (GOFDs) that capture dependencies within nodes, edges, and their combinations, defines corresponding graph-native normal forms, presents transformation algorithms, and evaluates the approach on synthetic and native graph datasets.

Significance. If the GOFD semantics and algorithms can be shown to eliminate redundancies and anomalies without introducing inconsistencies in LPGs, the work would provide a foundational extension of Codd-style normalization to graph data models. This addresses an important gap in KG quality management and could influence schema design and data cleaning practices in graph databases.

major comments (3)
  1. [Abstract] Abstract: The central claim that GOFDs enable normalization 'analogous to relational FDs' is load-bearing but unsupported; the manuscript supplies no concrete formal semantics for edge GOFDs (e.g., how a property on an edge is determined by source/target node properties or other edges) or proofs that enforcement preserves graph structure (multi-edges, directionality, connectivity).
  2. [Algorithms] Algorithms section: The transformation algorithms are described at a high level but lack any proof of correctness, termination, or anomaly elimination; without these, it is impossible to verify that normalization removes redundancies without creating new inconsistencies in the graph.
  3. [Evaluation] Evaluation section: The paper states that it evaluates the contributions on synthetic and native datasets, yet no metrics, baseline comparisons, or quantitative results (e.g., redundancy reduction, anomaly counts before/after) are provided, leaving the empirical support for the claims unverifiable.
minor comments (1)
  1. [Abstract] The abstract would be strengthened by including a small concrete example of a node-edge GOFD and the resulting normalized graph.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the detailed and constructive feedback. We address each major comment below and will revise the manuscript accordingly to provide stronger formal foundations and empirical support.

read point-by-point responses
  1. Referee: [Abstract] The central claim that GOFDs enable normalization 'analogous to relational FDs' is load-bearing but unsupported; the manuscript supplies no concrete formal semantics for edge GOFDs (e.g., how a property on an edge is determined by source/target node properties or other edges) or proofs that enforcement preserves graph structure (multi-edges, directionality, connectivity).

    Authors: We agree the abstract is high-level. Section 3 of the manuscript formally defines GOFDs, including edge GOFDs where an edge property is determined by a combination of source-node properties, target-node properties, and other edge properties. To address the concern, we will expand the abstract and add an explicit subsection with the full semantics plus a proof sketch showing that normalization steps preserve multi-edges, directionality, and connectivity. revision: yes

  2. Referee: [Algorithms] The transformation algorithms are described at a high level but lack any proof of correctness, termination, or anomaly elimination; without these, it is impossible to verify that normalization removes redundancies without creating new inconsistencies in the graph.

    Authors: We acknowledge the algorithms section presents pseudocode and high-level steps without formal proofs. We will add a new subsection containing proofs of correctness, termination (leveraging the finite size of the graph and strict reduction in dependency violations), and anomaly elimination that does not introduce inconsistencies while preserving graph invariants. revision: yes

  3. Referee: [Evaluation] The paper states that it evaluates the contributions on synthetic and native datasets, yet no metrics, baseline comparisons, or quantitative results (e.g., redundancy reduction, anomaly counts before/after) are provided, leaving the empirical support for the claims unverifiable.

    Authors: The current evaluation provides qualitative discussion of the datasets. We agree quantitative evidence is needed and will revise the section to include concrete metrics (redundancy reduction ratios, pre/post anomaly counts) together with baseline comparisons on both the synthetic and native graph datasets. revision: yes

Circularity Check

0 steps flagged

No circularity: graph-native normal forms extend relational theory independently

full rationale

The paper defines graph object functional dependencies (GOFDs) and graph-native normal forms by direct analogy to relational functional dependencies, explicitly covering nodes, edges, and their combinations as new constructs. No equations or algorithms reduce the proposed normal forms back to the input data or prior definitions by construction; the transformation algorithms are presented as novel procedures rather than tautological renamings or self-fitted parameters. The derivation chain remains self-contained against external relational benchmarks without load-bearing self-citations or ansatz smuggling.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The proposal rests on the domain assumption that graph objects admit functional dependencies analogous to relational tuples; no free parameters or invented entities are introduced in the abstract.

axioms (1)
  • domain assumption Functional dependencies can be defined for nodes, edges, and their combinations in labeled property graphs
    Direct extension of relational functional dependencies stated in the abstract as the foundation for graph-native normal forms.

pith-pipeline@v0.9.0 · 5434 in / 1024 out tokens · 27637 ms · 2026-05-15T16:43:55.218084+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

41 extracted references · 41 canonical work pages

  1. [1]

    1995.Foundations of Databases

    Serge Abiteboul, Richard Hull, and Victor Vianu. 1995.Foundations of Databases. Addison-Wesley

  2. [2]

    Shqiponja Ahmetaj, Iovka Boneva, Jan Hidders, Katja Hose, Maxime Jakubowski, José Emilio Labra Gayo, Wim Martens, Fabio Mogavero, Filip Murlak, Cem Okul- mus, Axel Polleres, Ognjen Savkovic, Mantas Simkus, and Dominik Tomaszuk

  3. [3]

    Common Foundations for SHACL, ShEx, and PG-Schema. InWWW. ACM, 8–21

  4. [4]

    Renzo Angles, Angela Bonifati, Stefania Dumbrava, George Fletcher, Alastair Green, Jan Hidders, Bei Li, Leonid Libkin, Victor Marsault, Wim Martens, Filip Murlak, Stefan Plantikow, Ognjen Savkovic, Michael Schmidt, Juan Sequeda, Slawek Staworko, Dominik Tomaszuk, Hannes Voigt, Domagoj Vrgoc, Mingxi Wu, and Dusan Zivkovic. 2023. PG-Schema: Schemas for Prop...

  5. [5]

    Hare, Jan Hidders, Victor E

    Renzo Angles, Angela Bonifati, Stefania Dumbrava, George Fletcher, Keith W. Hare, Jan Hidders, Victor E. Lee, Bei Li, Leonid Libkin, Wim Martens, Filip Murlak, Josh Perryman, Ognjen Savkovic, Michael Schmidt, Juan F. Sequeda, Slawek Staworko, and Dominik Tomaszuk. 2021. PG-Keys: Keys for Property Graphs. InSIGMOD Conference. ACM, 2423–2436

  6. [6]

    William Ward Armstrong. 1974. Dependency Structures of Data Base Relation- ships. InIFIP Congress. North-Holland, 580–583

  7. [7]

    2016.Data and Information Quality - Dimensions, Principles and Techniques

    Carlo Batini and Monica Scannapieco. 2016.Data and Information Quality - Dimensions, Principles and Techniques. Springer

  8. [8]

    Angela Bonifati. 2025. Versatile Property Graph Transformations.Proc. VLDB Endow.18, 12 (2025), 5516–5526

  9. [9]

    Angela Bonifati, George H. L. Fletcher, Hannes Voigt, and Nikolay Yakovets. 2018.Querying Graphs. Morgan & Claypool Publishers

  10. [10]

    Antoon Bronselaer. 2021. Data Quality Management: An Overview of Methods and Challenges. InFQAS (Lecture Notes in Computer Science, Vol. 12871). Springer, 127–141

  11. [11]

    E. F. Codd. 1970. A Relational Model of Data for Large Shared Data Banks. Commun. ACM13, 6 (1970), 377–387

  12. [12]

    E. F. Codd. 1971. Further Normalization of the Data Base Relational Model. Research Report / RJ / IBM / San Jose, CaliforniaRJ909 (1971)

  13. [13]

    E. F. Codd. 1974. Recent Investigations in Relational Data Base Systems. InIFIP Congress. North-Holland, 1017–1021

  14. [14]

    Claude Delobel and Richard G. Casey. 1973. Decomposition of a Data Base and the Theory of Boolean Switching Functions.IBM J. Res. Dev.17, 5 (1973), 374–386

  15. [15]

    Lisa Ehrlinger and Wolfram Wöß. 2018. A Novel Data Quality Metric for Mini- mality. InQUAT@WISE (Lecture Notes in Computer Science, Vol. 11235). Springer, 1–15

  16. [16]

    Nadime Francis, Amélie Gheerbrant, Paolo Guagliardo, Leonid Libkin, Victor Marsault, Wim Martens, Filip Murlak, Liat Peterfreund, Alexandra Rogova, and Domagoj Vrgoc. 2023. GPC: A Pattern Calculus for Property Graphs. InPODS. ACM, 241–250

  17. [17]

    Nadime Francis, Amélie Gheerbrant, Paolo Guagliardo, Leonid Libkin, Victor Marsault, Wim Martens, Filip Murlak, Liat Peterfreund, Alexandra Rogova, and Domagoj Vrgoc. 2023. A Researcher’s Digest of GQL (Invited Talk). InICDT (LIPIcs, Vol. 255). Schloss Dagstuhl - Leibniz-Zentrum für Informatik, 1:1–1:22

  18. [18]

    Jelle Hellings, Marc Gyssens, Jan Paredaens, and Yuqing Wu. 2014. Implication and Axiomatization of Functional Constraints on Patterns with an Application to the RDF Data Model. InFoIKS (Lecture Notes in Computer Science, Vol. 8367). Springer, 250–269

  19. [19]

    2018.Datenbanken - Konzepte und Sprachen, 6

    Andreas Heuer, Gunter Saake, and Kai-Uwe Sattler. 2018.Datenbanken - Konzepte und Sprachen, 6. Auflage. MITP

  20. [20]

    Rashid, Anisa Rula, Lukas Schmelzeisen, Juan F

    Aidan Hogan, Eva Blomqvist, Michael Cochez, Claudia d’Amato, Gerard de Melo, Claudio Gutierrez, Sabrina Kirrane, José Emilio Labra Gayo, Roberto Nav- igli, Sebastian Neumaier, Axel-Cyrille Ngonga Ngomo, Axel Polleres, Sabbir M. Rashid, Anisa Rula, Lukas Schmelzeisen, Juan F. Sequeda, Steffen Staab, and Antoine Zimmermann. 2022. Knowledge Graphs.ACM Comput...

  21. [21]

    ISO/IEC 39075:2024 - Information technology — Database languages — GQL

    ISO GQL. ISO/IEC 39075:2024 - Information technology — Database languages — GQL. https://www.iso.org/standard/76120.html

  22. [22]

    Ernests Lavrinovics, Russa Biswas, Johannes Bjerva, and Katja Hose. 2025. Knowl- edge Graphs, Large Language Models, and Hallucinations: An NLP Perspective. J. Web Semant.85 (2025), 100844

  23. [23]

    2007.Informationsintegration - Architekturen und Methoden zur Integration verteilter und heterogener Datenquellen

    Ulf Leser and Felix Naumann. 2007.Informationsintegration - Architekturen und Methoden zur Integration verteilter und heterogener Datenquellen. dpunkt.verlag

  24. [24]

    Junhong Lin, Xiaojie Guo, Yada Zhu, Samuel Mitchell, Erik Altman, and Julian Shun. 2024. FraudGT: A Simple, Effective, and Efficient Graph Transformer for Financial Fraud Detection. InICAIF. ACM, 292–300

  25. [25]

    Maude Manouvrier and Khalid Belhajjame. 2026. Graph functional dependencies: Analysis and translation to PG-schema.Inf. Syst.136 (2026), 102633

  26. [26]

    Graph data model

    Memgraph. Graph data model. https://memgraph.com/docs/data-modeling/ graph-data-model

  27. [27]

    Graph database concepts

    Neo4j. Graph database concepts. https://neo4j.com/docs/getting-started/ appendix/graphdb-concepts/

  28. [28]

    2026.Northwind Graph Example

    Neo4j. 2026.Northwind Graph Example. https://github.com/neo4j-graph- examples/northwind

  29. [29]

    Offshore Leaks Database

    International Consortium of Investigative Journalists. Offshore Leaks Database. https://github.com/ICIJ/offshoreleaks-data-packages

  30. [30]

    1989.The Structure of the Relational Database Model

    Jan Paredaens, Paul De Bra, Marc Gyssens, and Dirk Van Gucht. 1989.The Structure of the Relational Database Model. EATCS Monographs on Theoretical Computer Science, Vol. 17. Springer

  31. [31]

    Terence Parr. ANTLR. https://www.antlr.org

  32. [32]

    Kashif Rabbani, Matteo Lissandrini, Angela Bonifati, and Katja Hose. 2024. Trans- forming RDF Graphs to Property Graphs using Standardized Schemas.Proc. ACM Manag. Data2, 6 (2024), 242:1–242:25

  33. [33]

    Aref, Marcelo Arenas, Maciej Besta, Peter A

    Sherif Sakr, Angela Bonifati, Hannes Voigt, Alexandru Iosup, Khaled Ammar, Renzo Angles, Walid G. Aref, Marcelo Arenas, Maciej Besta, Peter A. Boncz, Johannes Schrott, Maxime Jakubowski, and Katja Hose Khuzaima Daudjee, Emanuele Della Valle, Stefania Dumbrava, Olaf Hartig, Bern- hard Haslhofer, Tim Hegeman, Jan Hidders, Katja Hose, Adriana Iamnitchi, Vasi...

  34. [34]

    2026.London Public Transport

    Johannes Schrott. 2026.London Public Transport. doi:10.5281/zenodo.18479731

  35. [35]

    2026.Train Services

    Johannes Schrott. 2026.Train Services. doi:10.5281/zenodo.18478421

  36. [36]

    Philipp Skavantzos and Sebastian Link. 2023. Normalizing Property Graphs. Proc. VLDB Endow.16, 11 (2023), 3031–3043

  37. [37]

    Philipp Skavantzos and Sebastian Link. 2025. Third and Boyce-Codd normal form for property graphs.VLDB J.34, 2 (2025), 23

  38. [38]

    Philipp Skavantzos, Kaiqi Zhao, and Sebastian Link. 2021. Uniqueness Constraints on Property Graphs. InCAiSE (Lecture Notes in Computer Science, Vol. 12751). Springer, 280–295

  39. [39]

    2000.Entity-relationship modeling - foundations of database technology

    Bernhard Thalheim. 2000.Entity-relationship modeling - foundations of database technology. Springer

  40. [40]

    Bernhard Thalheim. 2020. Schema Optimisation Instead of (Local) Normalisation. InFoIKS (Lecture Notes in Computer Science, Vol. 12012). Springer, 281–300

  41. [41]

    Piccadilly Circus

    Domagoj Vrgoc, Carlos Rojas, Renzo Angles, Marcelo Arenas, Diego Arroyuelo, Carlos Buil-Aranda, Aidan Hogan, Gonzalo Navarro, Cristian Riveros, and Juan Romero. 2023. MillenniumDB: An Open-Source Graph Database System.Data Intell.5, 3 (2023), 560–610. A Graph-Native Approach to Normalization A Appendix As supplementary material to the experimental evaluat...