Automated Big Data Quality Assessment using Knowledge Graph Embeddings
Pith reviewed 2026-05-20 21:24 UTC · model grok-4.3
The pith
Knowledge graph embeddings predict missing links to generate context-specific data quality plans for big data.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Our approach utilizes knowledge graph embeddings to predict missing edges between the input dataset's context representation and the relevant quality rules and dimensions within a knowledge graph representing contextual data characteristics and the required quality assessment operations. We surpass conventional practices by integrating diverse representations within the knowledge graph, drawing insights from contextual information from a thorough literature investigation. By injecting numerical edge attributes, we assign corresponding weights to each predicted quality measurement, providing a comprehensive data quality assessment plan for the input dataset.
What carries the argument
Knowledge graph embeddings that predict missing edges between dataset context nodes and quality rule or dimension nodes in a literature-derived graph, with numerical attributes supplying weights.
If this is right
- The method overcomes limitations of strict matching by incorporating contextual characteristics from literature.
- Numerical edge attributes provide weights for each predicted quality measurement.
- A comprehensive and context-specific assessment plan is generated for each input dataset.
- Evaluation on a real-world radiation sensors dataset confirms the approach can produce such a plan.
Where Pith is reading between the lines
- The same embedding technique could support real-time quality monitoring if the knowledge graph is updated dynamically.
- Expanding the literature sources in the graph might improve performance for specialized data types like financial or medical records.
- The weighted plans could feed directly into automated data cleaning or repair systems as priorities.
- Generalization tests on datasets outside the original sensor domain would clarify how domain-specific the current graph is.
Load-bearing premise
A knowledge graph assembled from literature review plus numerical edge attributes will allow embeddings to produce accurate, weighted quality measurements for arbitrary new input datasets.
What would settle it
Testing the generated quality assessment plans against independent expert reviews on several new datasets from different domains; low agreement on relevant rules or weights would show the predictions do not generalize reliably.
Figures
read the original abstract
Automated data quality assessment is crucial for managing big data, but existing solutions face challenges in achieving accurate context-aware assessment. This paper presents a novel knowledge-based approach to enhance automated data quality assessment. Our approach utilizes knowledge graph embeddings to predict missing edges between the input dataset's context representation and the relevant quality rules and dimensions within a knowledge graph representing contextual data characteristics and the required quality assessment operations. We surpass conventional practices by integrating diverse representations within the knowledge graph, drawing insights from contextual information from a thorough literature investigation. This integration allows us to develop a comprehensive and context-specific data quality assessment plan tailored to each context. Leveraging the knowledge graph improves our understanding of the input dataset's context, overcoming the limitations of traditional methods that rely solely on strict matching and overlook contextual characteristics. By injecting numerical edge attributes, we assign corresponding weights to each predicted quality measurement, providing a comprehensive data quality assessment plan for the input dataset. To evaluate our approach, we leverage AmpliGraph, a framework developed and benchmarked by AccentureLabs. The evaluation involves employing a real-world radiation sensors dataset provided by the Lebanese Atomic Energy Commission (LAEC-CNRS). The results obtained from this evaluation demonstrate the capability of our solution to generate a comprehensive data quality assessment plan for the given input dataset.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes a knowledge graph embedding method for automated, context-aware big data quality assessment. It constructs a KG from literature-derived contextual data characteristics and quality rules/dimensions, uses embeddings (via AmpliGraph) to predict missing edges linking an input dataset's context to relevant quality operations, injects numerical edge attributes to produce weighted scores, and claims this yields a comprehensive assessment plan superior to strict matching. Evaluation is described on a real-world radiation sensors dataset from LAEC-CNRS, with the abstract asserting that results demonstrate the method's capability to generate such a plan.
Significance. If the embedding-based predictions can be shown to be accurate, independent of KG construction choices, and generalizable via a clear mapping procedure for arbitrary new datasets, the work could offer a flexible alternative to rigid rule-matching approaches in data quality assessment. The choice of a real dataset and AmpliGraph framework is a constructive starting point, but the absence of any reported metrics leaves the practical significance unestablished.
major comments (3)
- [Abstract] Abstract (evaluation paragraph): The manuscript asserts that results on the LAEC-CNRS radiation sensors dataset 'demonstrate the capability of our solution to generate a comprehensive data quality assessment plan,' yet reports no quantitative link-prediction metrics (e.g., MRR, Hits@K, AUC), no baselines, no error bars, and no ablation on embedding hyperparameters or edge-weight injection. This directly undermines verification of the central claim that the approach produces accurate, weighted quality measurements.
- [Approach] Approach description (KG construction and edge prediction): No formal definition, algorithm, or pseudocode is supplied for encoding an arbitrary new input dataset's context features as a node, subgraph, or attribute vector within the literature-derived KG to enable reliable link prediction. Without this mechanism, the asserted advantage over strict matching remains an untested modeling assumption rather than a demonstrated result.
- [Approach] KG construction and prediction step: The central edge-prediction step operates on a graph whose nodes and relations are assembled from prior literature plus injected numerical attributes; the manuscript does not clarify whether the final quality scores constitute independent predictions or largely restate the input construction choices, raising a risk that performance is circular.
minor comments (2)
- [Abstract] The abstract refers to 'surpassing conventional practices' and 'overcoming the limitations of traditional methods' without any comparative evaluation or citation of specific baselines; this weakens the positioning of the contribution.
- [Approach] Notation for 'context representation,' 'quality rules and dimensions,' and 'numerical edge attributes' is introduced at a high level but never formalized or illustrated with an example subgraph or embedding vector.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback on our manuscript. The comments highlight important areas where additional clarity and evidence are needed to strengthen the claims. We address each major comment point by point below, indicating the revisions we will incorporate.
read point-by-point responses
-
Referee: Abstract claims results on LAEC-CNRS dataset demonstrate capability to generate comprehensive plan, yet reports no quantitative link-prediction metrics (MRR, Hits@K, AUC), no baselines, no error bars, and no ablation on hyperparameters or edge-weight injection.
Authors: We agree that the current manuscript does not report the quantitative metrics needed to substantiate the central claim. In the revised version we will add standard link-prediction metrics (MRR, Hits@K, AUC-ROC) computed on the radiation-sensor dataset using AmpliGraph. We will also include baseline comparisons against strict rule-matching and random link prediction, report error bars from repeated training runs, and present a brief ablation on embedding dimension and the numerical-attribute injection step. These additions will allow direct verification of prediction accuracy. revision: yes
-
Referee: No formal definition, algorithm, or pseudocode supplied for encoding an arbitrary new input dataset's context features as a node, subgraph, or attribute vector within the literature-derived KG to enable reliable link prediction.
Authors: We acknowledge that the manuscript lacks an explicit, reusable specification of the context-encoding step. In the revision we will introduce a formal definition of the context representation (as a set of typed nodes and attribute vectors derived from dataset metadata), together with pseudocode that shows how these elements are inserted into the pre-built literature KG before link prediction is performed. This will make the mapping procedure for new datasets explicit and demonstrate the claimed generality beyond the single evaluation case. revision: yes
-
Referee: Central edge-prediction step operates on a graph assembled from prior literature plus injected numerical attributes; manuscript does not clarify whether final quality scores are independent predictions or largely restate input construction choices, raising risk of circular performance.
Authors: We appreciate the concern about potential circularity. The base KG is constructed exclusively from literature-derived contextual characteristics and quality rules/dimensions; no information from the evaluation dataset enters this construction. Embeddings are learned on this fixed graph. For a new dataset only its context nodes and attributes are added, after which the trained model predicts missing edges to quality operations according to the learned latent patterns. Numerical attributes are applied only after prediction to produce weighted scores. We will add a clarifying subsection with a concrete example that contrasts the predicted edges against what a direct lookup of the construction choices would yield, thereby showing that the scores are not circular. revision: yes
Circularity Check
No significant circularity; derivation remains independent of inputs
full rationale
The paper constructs a knowledge graph from literature-derived contextual characteristics and quality rules/dimensions, then applies embeddings (via AmpliGraph) to predict missing edges linking a new dataset's context representation to those rules. No equations, definitions, or self-citations are shown that make the final weighted quality scores equivalent to the graph-construction choices by construction. The evaluation uses an external real-world radiation sensors dataset, and the central claim of context-aware prediction is not reduced to a fitted parameter or renamed input. This is a standard non-circular modeling pipeline.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption A knowledge graph assembled from a thorough literature investigation accurately encodes contextual data characteristics and the required quality assessment operations.
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/AbsoluteFloorClosure.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Our approach utilizes knowledge graph embeddings to predict missing edges between the input dataset's context representation and the relevant quality rules and dimensions within a knowledge graph
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
By injecting numerical edge attributes, we assign corresponding weights to each predicted quality measurement
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Data quality in context.Communications of the ACM, 40(5):103–110, 1997
Diane M Strong, Yang W Lee, and Richard Y Wang. Data quality in context.Communications of the ACM, 40(5):103–110, 1997
work page 1997
-
[2]
Beyond accuracy: What data quality means to data consumers.J
Richard Y Wang and Diane Strong. Beyond accuracy: What data quality means to data consumers.J. Manag. Inf. Syst., 12:5–33, 1996
work page 1996
-
[4]
Context-aware big data quality assessment: A scoping review.J
Hadi Fadlallah, Rima Kilany, Houssein Dhayne, Rami el Haddad, Rafiqul Haque, Yehia Taher, and Ali Jaber. Context-aware big data quality assessment: A scoping review.J. Data and Information Quality, jun 2023. Just Accepted
work page 2023
-
[5]
Bigqa: Declarative big data quality assessment.J
Hadi Fadlallah, Rima Kilany, Houssein Dhayne, Rami el Haddad, Rafiqul Haque, Yehia Taher, and Ali Jaber. Bigqa: Declarative big data quality assessment.J. Data and Information Quality, jun 2023. Just Accepted
work page 2023
-
[6]
Iso/iec 15939:2017 systems and software engineering — measurement process
ISO/IEC. Iso/iec 15939:2017 systems and software engineering — measurement process. Standard, ISO/IEC, 2017
work page 2017
-
[7]
ISO/IEC. Iso/iec 25000:2014. systems and software engineering – system and software quality requirements and evaluation (square) – guide to square. Standard, ISO/IEC, 2014
work page 2014
-
[8]
Iso/iec 20547-3:2020 big data reference architecture - part 3:reference architecture
ISO/IEC. Iso/iec 20547-3:2020 big data reference architecture - part 3:reference architecture. Standard, ISO/IEC, 2020
work page 2020
-
[9]
node2vec: Scalable feature learning for networks
Aditya Grover and Jure Leskovec. node2vec: Scalable feature learning for networks. InProceedings of the 22nd ACM SIGKDD international conference on Knowledge discovery and data mining, pages 855–864, 2016
work page 2016
-
[10]
K-nearest neighbor.Scholarpedia, 4(2):1883, 2009
Leif E Peterson. K-nearest neighbor.Scholarpedia, 4(2):1883, 2009
work page 2009
-
[11]
Accenture/ampligraph: Ampligraph 2.0.0, March 2023
Luca Costabello, Alberto Bernardi, Adrianna Janik, and Sumit Pai. Accenture/ampligraph: Ampligraph 2.0.0, March 2023
work page 2023
-
[12]
Ilya Makarov, Dmitrii Kiselev, Nikita Nikitinsky, and Lovro Subelj. Survey on graph embeddings and their applications to machine learning problems on graphs.PeerJ Computer Science, 7:e357, 2021
work page 2021
-
[13]
Yuanfei Dai, Shiping Wang, Neal N Xiong, and Wenzhong Guo. A survey on knowledge graph embedding: Approaches, applications and benchmarks.Electronics, 9(5):750, 2020
work page 2020
-
[14]
Sumit Pai and Luca Costabello. Learning embeddings from knowledge graphs with numeric edge attributes.arXiv preprint arXiv:2105.08683, 2021
-
[15]
Complex embeddings for simple link prediction
Théo Trouillon, Johannes Welbl, Sebastian Riedel, Éric Gaussier, and Guillaume Bouchard. Complex embeddings for simple link prediction. InInternational conference on machine learning, pages 2071–2080. PMLR, 2016
work page 2071
-
[16]
Value-driven data quality assessment
Adir Even and Ganesan Shankaranarayanan. Value-driven data quality assessment. InICIQ, 2005
work page 2005
-
[17]
Adir Even and Ganesan Shankaranarayanan. Utility-driven assessment of data quality.ACM SIGMIS Database: the DATABASE for Advances in Information Systems, 38(2):75–93, 2007
work page 2007
-
[18]
Big data pre-processing: A quality framework
Ikbal Taleb, Rachida Dssouli, and Mohamed Adel Serhani. Big data pre-processing: A quality framework. In 2015 IEEE international congress on big data, pages 191–198. IEEE, 2015
work page 2015
-
[19]
Big data quality assessment model for unstructured data
Ikbal Taleb, Mohamed Adel Serhani, and Rachida Dssouli. Big data quality assessment model for unstructured data. In2018 International Conference on Innovations in Information Technology (IIT), pages 69–74. IEEE, 2018
work page 2018
-
[20]
Li Cai and Yangyong Zhu. The challenges of data quality and data quality assessment in the big data era.Data science journal, 14, 2015. 15 APREPRINT- FEBRUARY, 26 2024
work page 2015
-
[21]
A data quality in use model for big data.Future Generation Computer Systems, 63:123–130, 2016
Jorge Merino, Ismael Caballero, Bibiano Rivas, Manuel Serrano, and Mario Piattini. A data quality in use model for big data.Future Generation Computer Systems, 63:123–130, 2016
work page 2016
-
[22]
Evaluating the quality of social media data in big data architecture.Ieee Access, 3:2028–2043, 2015
Anne Immonen, Pekka Pääkkönen, and Eila Ovaska. Evaluating the quality of social media data in big data architecture.Ieee Access, 3:2028–2043, 2015
work page 2028
-
[23]
Pekka Pääkkönen and Daniel Pakkala. Reference architecture and classification of technologies, products and services for big data systems.Big data research, 2(4):166–186, 2015
work page 2015
-
[24]
Carlo Batini, Federico Cabitza, Daniele Barone, Federico Cabitza, and Simone Grega. A data quality methodology for heterogeneous data.International Journal of Database Management Systems, 3(1):60–79, 2011
work page 2011
-
[25]
A context aware information quality framework
Markus Helfert and Owen Foley. A context aware information quality framework. In2009 F ourth International Conference on Cooperation and Promotion of Information Resources in Science and Technology, pages 187–193. IEEE, 2009
work page 2009
-
[26]
Operational measurement of data quality
Antoon Bronselaer, Joachim Nielandt, Toon Boeckling, and Guy De Tré. Operational measurement of data quality. InInformation Processing and Management of Uncertainty in Knowledge-Based Systems. Applications: 17th International Conference, IPMU 2018, Cádiz, Spain, June 11-15, 2018, Proceedings, Part III 17, pages 517–528. Springer, 2018
work page 2018
-
[27]
Zoubida Kedad and Elisabeth Métais. Ontology-based data cleaning. InNatural Language Processing and Information Systems: 6th International Conference on Applications of Natural Language to Information Systems, NLDB 2002 Stockholm, Sweden, June 27–28, 2002 Revised Papers 6, pages 137–149. Springer, 2002
work page 2002
-
[28]
An ontology-based approach to data cleaning
Xin Wang, Howard John Hamilton, and Yashu Bither. An ontology-based approach to data cleaning. Department of Computer Science, University of Regina Regina, SK, Canada, 2005
work page 2005
-
[29]
Chen Wei-Liang, Zhang Shi-Dong, and Gao Xiang. Anchoring the consistency dimension of data quality using ontology in data integration.2009 Sixth Web Information Systems and Applications Conference, pages 201–205, 2009
work page 2009
-
[30]
An ontology-based approach for data cleaning
Paulo Oliveira, Maria de Fatima Rodrigues, and Pedro Rangel Henriques. An ontology-based approach for data cleaning. InICIQ, pages 307–320, 2006
work page 2006
-
[31]
Quality views: capturing and exploiting the user perspective on data quality
Paolo Missier, Suzanne Embury, Mark Greenwood, Alun Preece, and Binling Jin. Quality views: capturing and exploiting the user perspective on data quality. InProceedings of the International Conference on V ery Large Data Bases, volume 32, page 977, 2006
work page 2006
-
[32]
Using ontologies providing domain knowledge for data quality management
Stefan Brüggemann and Fabian Grüning. Using ontologies providing domain knowledge for data quality management. InNetworked Knowledge - Networked Media - Integrating Knowledge Management, pages 187–203. Springer, 2009
work page 2009
-
[33]
O-Hoon Choi, Jun-Eun Lim, Hong-Seok Na, and Doo-Kwon Baik. An efficient method of data quality using quality evaluation ontology.2008 Third International Conference on Convergence and Hybrid Information Technology, 2:1058–1061, 2008
work page 2008
-
[34]
Building semantic mappings from databases to ontologies
Yuan An, John Mylopoulos, and Alexander Borgida. Building semantic mappings from databases to ontologies. InAAAI, pages 1557–1566, 2006
work page 2006
-
[35]
Using ontologies for xml data cleaning
Diego Milano, Monica Scannapieco, and Tiziana Catarci. Using ontologies for xml data cleaning. InOn the Move to Meaningful Internet Systems 2005: OTM 2005 Workshops: OTM Confederated Internationl Workshops and Posters, A WeSOMe, CAMS, GADA, MIOS+ INTEROP , ORM, PhDS, SeBGIS, SWWS, and WOSE 2005, Agia Napa, Cyprus, October 31-November 4, 2005. Proceedings,...
work page 2005
-
[36]
Automated data quality monitoring
Lisa Ehrlinger and Wolfram Wöß. Automated data quality monitoring. InProceedings of the 22nd MIT International Conference on Information Quality (ICIQ 2017), pages 15–1, 2017
work page 2017
-
[37]
Data quality ontology: an ontology for imperfect knowledge
Andrew U Frank. Data quality ontology: an ontology for imperfect knowledge. InCOSIT, pages 406–420, 2007
work page 2007
-
[38]
Siaw-Teng Liaw, Alireza Rahimi, Pradeep Ray, Jane Taggart, Sarah Dennis, Simon de Lusignan, B Jalaludin, AET Yeo, and Amir Talaei-Khoei. Towards an ontology for data quality in integrated chronic disease management: a realist review of the literature.International journal of medical informatics, 82(1):10–24, 2013
work page 2013
-
[39]
A data quality ontology for the secondary use of ehr data
Steven G Johnson, Stuart Speedie, Gyorgy Simon, Vipin Kumar, and Bonnie L Westra. A data quality ontology for the secondary use of ehr data. InAMIA Annual Symposium Proceedings, volume 2015, page 1937. American Medical Informatics Association, 2015
work page 2015
-
[40]
An ontology to assess data quality domains
Angelica Urrutia, Emma Chavez, Regina Motz, and Rosa Gajardo. An ontology to assess data quality domains. a case study applied to a health care entity.IEEE Latin America Transactions, 15(8):1506–1512, 2017
work page 2017
-
[41]
Ontology-based data quality framework for data stream applications
Sandra Geisler, Sven Weber, and Christoph Quix. Ontology-based data quality framework for data stream applications. InICIQ, 2011. 16 APREPRINT- FEBRUARY, 26 2024
work page 2011
-
[42]
What is a knowledge graph?, (accessed May 15, 2023)
Stanford University. What is a knowledge graph?, (accessed May 15, 2023). https://web.stanford.edu/ ~vinayc/kg/notes/What_is_a_Knowledge_Graph.html
work page 2023
-
[43]
Hadi Fadlallah, Yehia Taher, and Ali H. Jaber. Raden: A scalable and efficient radiation data engineering. In BDCSIntell, 2018
work page 2018
-
[44]
Hadi Fadlallah, Yehia Taher, Rafiqul Haque, and Ali H. Jaber. Oradiex: A big data driven smart framework for real-time surveillance and analysis of individual exposure to radioactive pollution. InBDCSIntell, 2019
work page 2019
-
[45]
ISO/IEC. 25012:2008 software engineering — software product quality requirements and evaluation (square) — data quality model. Standard, ISO/IEC, 2008
work page 2008
-
[46]
TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed Systems
Martín Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro, Greg S Corrado, Andy Davis, Jeffrey Dean, Matthieu Devin, et al. Tensorflow: Large-scale machine learning on heterogeneous distributed systems.arXiv preprint arXiv:1603.04467, 2016. 17
work page internal anchor Pith review Pith/arXiv arXiv 2016
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.