Automated Big Data Quality Assessment using Knowledge Graph Embeddings

Ali Jaber; Hadi Fadlallah; Mitri Haber; Rima Kilany

arxiv: 2605.18833 · v1 · pith:XXAG2YL5new · submitted 2026-05-12 · 💻 cs.LG · cs.AI

Automated Big Data Quality Assessment using Knowledge Graph Embeddings

Hadi Fadlallah , Rima Kilany , Mitri Haber , Ali Jaber This is my paper

Pith reviewed 2026-05-20 21:24 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords data quality assessmentknowledge graph embeddingsbig datacontext-awarequality rulesedge predictionautomated assessment

0 comments

The pith

Knowledge graph embeddings predict missing links to generate context-specific data quality plans for big data.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper aims to show that knowledge graph embeddings can predict which quality rules and dimensions apply to a new dataset by linking its context representation to a graph built from literature on data characteristics and assessment operations. This matters because traditional automated methods rely on strict matching that ignores context, often resulting in incomplete or inaccurate quality evaluations for big data. By adding numerical attributes to edges for weighting and using embeddings to fill prediction gaps, the approach produces a tailored assessment plan with prioritized measurements. A sympathetic reader cares as reliable quality assessment supports trustworthy insights from large datasets in research and industry.

Core claim

Our approach utilizes knowledge graph embeddings to predict missing edges between the input dataset's context representation and the relevant quality rules and dimensions within a knowledge graph representing contextual data characteristics and the required quality assessment operations. We surpass conventional practices by integrating diverse representations within the knowledge graph, drawing insights from contextual information from a thorough literature investigation. By injecting numerical edge attributes, we assign corresponding weights to each predicted quality measurement, providing a comprehensive data quality assessment plan for the input dataset.

What carries the argument

Knowledge graph embeddings that predict missing edges between dataset context nodes and quality rule or dimension nodes in a literature-derived graph, with numerical attributes supplying weights.

If this is right

The method overcomes limitations of strict matching by incorporating contextual characteristics from literature.
Numerical edge attributes provide weights for each predicted quality measurement.
A comprehensive and context-specific assessment plan is generated for each input dataset.
Evaluation on a real-world radiation sensors dataset confirms the approach can produce such a plan.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same embedding technique could support real-time quality monitoring if the knowledge graph is updated dynamically.
Expanding the literature sources in the graph might improve performance for specialized data types like financial or medical records.
The weighted plans could feed directly into automated data cleaning or repair systems as priorities.
Generalization tests on datasets outside the original sensor domain would clarify how domain-specific the current graph is.

Load-bearing premise

A knowledge graph assembled from literature review plus numerical edge attributes will allow embeddings to produce accurate, weighted quality measurements for arbitrary new input datasets.

What would settle it

Testing the generated quality assessment plans against independent expert reviews on several new datasets from different domains; low agreement on relevant rules or weights would show the predictions do not generalize reliably.

Figures

Figures reproduced from arXiv: 2605.18833 by Ali Jaber, Hadi Fadlallah, Mitri Haber, Rima Kilany.

**Figure 2.** Figure 2: A knowledge graph embedding model architecture enhanced with FocusE [14] [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: Data context characteristics – Content type: Describes the nature or topic of the data, such as fun, sports, politics, etc. – File format: Specifies the format in which the data is stored, such as JSON, XML, or relational database (RDB). • Organizational context: – Adopted standards: Identifies any standards or guidelines the organization follows in managing and processing the data. – Organizational polici… view at source ↗

**Figure 4.** Figure 4: BIGQA context analyzer workflow [5] and details that could contribute to a more accurate and precise assessment plan. By solely relying on the most pertinent data context, there is a risk of overlooking potentially valuable information that could enhance the quality assessment process. Therefore, it is essential to consider the broader range of data contexts within the knowledge graph to ensure a comprehen… view at source ↗

**Figure 5.** Figure 5: A representation of a data context with the related data quality assessment [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

**Figure 6.** Figure 6: Predicting possible data quality assessment plan of a new data context [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗

**Figure 7.** Figure 7: Generating data quality assessment plan using a knowledge graph embedding model [PITH_FULL_IMAGE:figures/full_fig_p009_7.png] view at source ↗

**Figure 8.** Figure 8: Retrieved data quality assessment plan using BIGQA context analyzer [PITH_FULL_IMAGE:figures/full_fig_p011_8.png] view at source ↗

**Figure 9.** Figure 9: Retrieved data quality assessment plan using AmpliGraph [PITH_FULL_IMAGE:figures/full_fig_p014_9.png] view at source ↗

read the original abstract

Automated data quality assessment is crucial for managing big data, but existing solutions face challenges in achieving accurate context-aware assessment. This paper presents a novel knowledge-based approach to enhance automated data quality assessment. Our approach utilizes knowledge graph embeddings to predict missing edges between the input dataset's context representation and the relevant quality rules and dimensions within a knowledge graph representing contextual data characteristics and the required quality assessment operations. We surpass conventional practices by integrating diverse representations within the knowledge graph, drawing insights from contextual information from a thorough literature investigation. This integration allows us to develop a comprehensive and context-specific data quality assessment plan tailored to each context. Leveraging the knowledge graph improves our understanding of the input dataset's context, overcoming the limitations of traditional methods that rely solely on strict matching and overlook contextual characteristics. By injecting numerical edge attributes, we assign corresponding weights to each predicted quality measurement, providing a comprehensive data quality assessment plan for the input dataset. To evaluate our approach, we leverage AmpliGraph, a framework developed and benchmarked by AccentureLabs. The evaluation involves employing a real-world radiation sensors dataset provided by the Lebanese Atomic Energy Commission (LAEC-CNRS). The results obtained from this evaluation demonstrate the capability of our solution to generate a comprehensive data quality assessment plan for the given input dataset.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper applies literature-derived knowledge graphs and embeddings to generate context-aware data quality plans, tested on one real sensor dataset, but supplies no metrics or clear input-mapping steps.

read the letter

The main takeaway is that this work builds a knowledge graph from data quality literature, injects numerical edge weights, and runs embeddings to predict which rules and dimensions fit a new dataset's context. The goal is a tailored assessment plan that goes beyond rigid matching by using embedding similarities. They evaluate on a radiation sensors dataset from the Lebanese Atomic Energy Commission and use the AmpliGraph library, which at least ties the idea to an established tool and real data rather than synthetic examples. That choice earns some credit for moving toward practicality in big-data pipelines. The numerical weighting on edges is a simple addition that could help rank outputs. The soft spots are the missing technical links. There is no explicit algorithm or formal definition for turning an arbitrary input dataset's context features into nodes, attributes, or a subgraph that the embeddings can then predict over. Without that, the advantage claimed over strict matching stays untested. The results section asserts that the method produces a comprehensive plan and shows capability, yet reports no quantitative scores, baselines, error rates, MRR, Hits@K, or ablation details on the embeddings or hyperparameters. This makes it impossible to judge whether the outputs are reliable or better than simpler approaches. The circularity concern also lands: since the graph nodes and relations come from prior literature, the predictions risk largely restating the construction choices rather than discovering independent quality signals. This paper would interest data engineers and applied researchers working on automated quality checks for sensor or scientific datasets. A reader focused on practical KG uses in industry might extract the high-level architecture as a starting point. The thinking is straightforward and engages honestly with existing data-quality work, so it counts as serious. I would send it to peer review, expecting referees to request the missing mapping details, standard metrics, and comparisons before the claims can be assessed properly.

Referee Report

3 major / 2 minor

Summary. The paper proposes a knowledge graph embedding method for automated, context-aware big data quality assessment. It constructs a KG from literature-derived contextual data characteristics and quality rules/dimensions, uses embeddings (via AmpliGraph) to predict missing edges linking an input dataset's context to relevant quality operations, injects numerical edge attributes to produce weighted scores, and claims this yields a comprehensive assessment plan superior to strict matching. Evaluation is described on a real-world radiation sensors dataset from LAEC-CNRS, with the abstract asserting that results demonstrate the method's capability to generate such a plan.

Significance. If the embedding-based predictions can be shown to be accurate, independent of KG construction choices, and generalizable via a clear mapping procedure for arbitrary new datasets, the work could offer a flexible alternative to rigid rule-matching approaches in data quality assessment. The choice of a real dataset and AmpliGraph framework is a constructive starting point, but the absence of any reported metrics leaves the practical significance unestablished.

major comments (3)

[Abstract] Abstract (evaluation paragraph): The manuscript asserts that results on the LAEC-CNRS radiation sensors dataset 'demonstrate the capability of our solution to generate a comprehensive data quality assessment plan,' yet reports no quantitative link-prediction metrics (e.g., MRR, Hits@K, AUC), no baselines, no error bars, and no ablation on embedding hyperparameters or edge-weight injection. This directly undermines verification of the central claim that the approach produces accurate, weighted quality measurements.
[Approach] Approach description (KG construction and edge prediction): No formal definition, algorithm, or pseudocode is supplied for encoding an arbitrary new input dataset's context features as a node, subgraph, or attribute vector within the literature-derived KG to enable reliable link prediction. Without this mechanism, the asserted advantage over strict matching remains an untested modeling assumption rather than a demonstrated result.
[Approach] KG construction and prediction step: The central edge-prediction step operates on a graph whose nodes and relations are assembled from prior literature plus injected numerical attributes; the manuscript does not clarify whether the final quality scores constitute independent predictions or largely restate the input construction choices, raising a risk that performance is circular.

minor comments (2)

[Abstract] The abstract refers to 'surpassing conventional practices' and 'overcoming the limitations of traditional methods' without any comparative evaluation or citation of specific baselines; this weakens the positioning of the contribution.
[Approach] Notation for 'context representation,' 'quality rules and dimensions,' and 'numerical edge attributes' is introduced at a high level but never formalized or illustrated with an example subgraph or embedding vector.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback on our manuscript. The comments highlight important areas where additional clarity and evidence are needed to strengthen the claims. We address each major comment point by point below, indicating the revisions we will incorporate.

read point-by-point responses

Referee: Abstract claims results on LAEC-CNRS dataset demonstrate capability to generate comprehensive plan, yet reports no quantitative link-prediction metrics (MRR, Hits@K, AUC), no baselines, no error bars, and no ablation on hyperparameters or edge-weight injection.

Authors: We agree that the current manuscript does not report the quantitative metrics needed to substantiate the central claim. In the revised version we will add standard link-prediction metrics (MRR, Hits@K, AUC-ROC) computed on the radiation-sensor dataset using AmpliGraph. We will also include baseline comparisons against strict rule-matching and random link prediction, report error bars from repeated training runs, and present a brief ablation on embedding dimension and the numerical-attribute injection step. These additions will allow direct verification of prediction accuracy. revision: yes
Referee: No formal definition, algorithm, or pseudocode supplied for encoding an arbitrary new input dataset's context features as a node, subgraph, or attribute vector within the literature-derived KG to enable reliable link prediction.

Authors: We acknowledge that the manuscript lacks an explicit, reusable specification of the context-encoding step. In the revision we will introduce a formal definition of the context representation (as a set of typed nodes and attribute vectors derived from dataset metadata), together with pseudocode that shows how these elements are inserted into the pre-built literature KG before link prediction is performed. This will make the mapping procedure for new datasets explicit and demonstrate the claimed generality beyond the single evaluation case. revision: yes
Referee: Central edge-prediction step operates on a graph assembled from prior literature plus injected numerical attributes; manuscript does not clarify whether final quality scores are independent predictions or largely restate input construction choices, raising risk of circular performance.

Authors: We appreciate the concern about potential circularity. The base KG is constructed exclusively from literature-derived contextual characteristics and quality rules/dimensions; no information from the evaluation dataset enters this construction. Embeddings are learned on this fixed graph. For a new dataset only its context nodes and attributes are added, after which the trained model predicts missing edges to quality operations according to the learned latent patterns. Numerical attributes are applied only after prediction to produce weighted scores. We will add a clarifying subsection with a concrete example that contrasts the predicted edges against what a direct lookup of the construction choices would yield, thereby showing that the scores are not circular. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation remains independent of inputs

full rationale

The paper constructs a knowledge graph from literature-derived contextual characteristics and quality rules/dimensions, then applies embeddings (via AmpliGraph) to predict missing edges linking a new dataset's context representation to those rules. No equations, definitions, or self-citations are shown that make the final weighted quality scores equivalent to the graph-construction choices by construction. The evaluation uses an external real-world radiation sensors dataset, and the central claim of context-aware prediction is not reduced to a fitted parameter or renamed input. This is a standard non-circular modeling pipeline.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The method rests on the assumption that literature can be compiled into a sufficiently complete and accurate knowledge graph whose missing edges, once predicted by embeddings, directly yield useful quality weights. No explicit free parameters or invented entities are named in the abstract.

axioms (1)

domain assumption A knowledge graph assembled from a thorough literature investigation accurately encodes contextual data characteristics and the required quality assessment operations.
This premise is required for the embedding step to produce a context-specific assessment plan.

pith-pipeline@v0.9.0 · 5760 in / 1164 out tokens · 45271 ms · 2026-05-20T21:24:59.851337+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/AbsoluteFloorClosure.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Our approach utilizes knowledge graph embeddings to predict missing edges between the input dataset's context representation and the relevant quality rules and dimensions within a knowledge graph
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

By injecting numerical edge attributes, we assign corresponding weights to each predicted quality measurement

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

45 extracted references · 45 canonical work pages · 1 internal anchor

[1]

Data quality in context.Communications of the ACM, 40(5):103–110, 1997

Diane M Strong, Yang W Lee, and Richard Y Wang. Data quality in context.Communications of the ACM, 40(5):103–110, 1997

work page 1997
[2]

Beyond accuracy: What data quality means to data consumers.J

Richard Y Wang and Diane Strong. Beyond accuracy: What data quality means to data consumers.J. Manag. Inf. Syst., 12:5–33, 1996

work page 1996
[4]

Context-aware big data quality assessment: A scoping review.J

Hadi Fadlallah, Rima Kilany, Houssein Dhayne, Rami el Haddad, Rafiqul Haque, Yehia Taher, and Ali Jaber. Context-aware big data quality assessment: A scoping review.J. Data and Information Quality, jun 2023. Just Accepted

work page 2023
[5]

Bigqa: Declarative big data quality assessment.J

Hadi Fadlallah, Rima Kilany, Houssein Dhayne, Rami el Haddad, Rafiqul Haque, Yehia Taher, and Ali Jaber. Bigqa: Declarative big data quality assessment.J. Data and Information Quality, jun 2023. Just Accepted

work page 2023
[6]

Iso/iec 15939:2017 systems and software engineering — measurement process

ISO/IEC. Iso/iec 15939:2017 systems and software engineering — measurement process. Standard, ISO/IEC, 2017

work page 2017
[7]

Iso/iec 25000:2014

ISO/IEC. Iso/iec 25000:2014. systems and software engineering – system and software quality requirements and evaluation (square) – guide to square. Standard, ISO/IEC, 2014

work page 2014
[8]

Iso/iec 20547-3:2020 big data reference architecture - part 3:reference architecture

ISO/IEC. Iso/iec 20547-3:2020 big data reference architecture - part 3:reference architecture. Standard, ISO/IEC, 2020

work page 2020
[9]

node2vec: Scalable feature learning for networks

Aditya Grover and Jure Leskovec. node2vec: Scalable feature learning for networks. InProceedings of the 22nd ACM SIGKDD international conference on Knowledge discovery and data mining, pages 855–864, 2016

work page 2016
[10]

K-nearest neighbor.Scholarpedia, 4(2):1883, 2009

Leif E Peterson. K-nearest neighbor.Scholarpedia, 4(2):1883, 2009

work page 2009
[11]

Accenture/ampligraph: Ampligraph 2.0.0, March 2023

Luca Costabello, Alberto Bernardi, Adrianna Janik, and Sumit Pai. Accenture/ampligraph: Ampligraph 2.0.0, March 2023

work page 2023
[12]

Survey on graph embeddings and their applications to machine learning problems on graphs.PeerJ Computer Science, 7:e357, 2021

Ilya Makarov, Dmitrii Kiselev, Nikita Nikitinsky, and Lovro Subelj. Survey on graph embeddings and their applications to machine learning problems on graphs.PeerJ Computer Science, 7:e357, 2021

work page 2021
[13]

A survey on knowledge graph embedding: Approaches, applications and benchmarks.Electronics, 9(5):750, 2020

Yuanfei Dai, Shiping Wang, Neal N Xiong, and Wenzhong Guo. A survey on knowledge graph embedding: Approaches, applications and benchmarks.Electronics, 9(5):750, 2020

work page 2020
[14]

Learning embeddings from knowledge graphs with numeric edge attributes.arXiv preprint arXiv:2105.08683, 2021

Sumit Pai and Luca Costabello. Learning embeddings from knowledge graphs with numeric edge attributes.arXiv preprint arXiv:2105.08683, 2021

work page arXiv 2021
[15]

Complex embeddings for simple link prediction

Théo Trouillon, Johannes Welbl, Sebastian Riedel, Éric Gaussier, and Guillaume Bouchard. Complex embeddings for simple link prediction. InInternational conference on machine learning, pages 2071–2080. PMLR, 2016

work page 2071
[16]

Value-driven data quality assessment

Adir Even and Ganesan Shankaranarayanan. Value-driven data quality assessment. InICIQ, 2005

work page 2005
[17]

Utility-driven assessment of data quality.ACM SIGMIS Database: the DATABASE for Advances in Information Systems, 38(2):75–93, 2007

Adir Even and Ganesan Shankaranarayanan. Utility-driven assessment of data quality.ACM SIGMIS Database: the DATABASE for Advances in Information Systems, 38(2):75–93, 2007

work page 2007
[18]

Big data pre-processing: A quality framework

Ikbal Taleb, Rachida Dssouli, and Mohamed Adel Serhani. Big data pre-processing: A quality framework. In 2015 IEEE international congress on big data, pages 191–198. IEEE, 2015

work page 2015
[19]

Big data quality assessment model for unstructured data

Ikbal Taleb, Mohamed Adel Serhani, and Rachida Dssouli. Big data quality assessment model for unstructured data. In2018 International Conference on Innovations in Information Technology (IIT), pages 69–74. IEEE, 2018

work page 2018
[20]

The challenges of data quality and data quality assessment in the big data era.Data science journal, 14, 2015

Li Cai and Yangyong Zhu. The challenges of data quality and data quality assessment in the big data era.Data science journal, 14, 2015. 15 APREPRINT- FEBRUARY, 26 2024

work page 2015
[21]

A data quality in use model for big data.Future Generation Computer Systems, 63:123–130, 2016

Jorge Merino, Ismael Caballero, Bibiano Rivas, Manuel Serrano, and Mario Piattini. A data quality in use model for big data.Future Generation Computer Systems, 63:123–130, 2016

work page 2016
[22]

Evaluating the quality of social media data in big data architecture.Ieee Access, 3:2028–2043, 2015

Anne Immonen, Pekka Pääkkönen, and Eila Ovaska. Evaluating the quality of social media data in big data architecture.Ieee Access, 3:2028–2043, 2015

work page 2028
[23]

Reference architecture and classification of technologies, products and services for big data systems.Big data research, 2(4):166–186, 2015

Pekka Pääkkönen and Daniel Pakkala. Reference architecture and classification of technologies, products and services for big data systems.Big data research, 2(4):166–186, 2015

work page 2015
[24]

A data quality methodology for heterogeneous data.International Journal of Database Management Systems, 3(1):60–79, 2011

Carlo Batini, Federico Cabitza, Daniele Barone, Federico Cabitza, and Simone Grega. A data quality methodology for heterogeneous data.International Journal of Database Management Systems, 3(1):60–79, 2011

work page 2011
[25]

A context aware information quality framework

Markus Helfert and Owen Foley. A context aware information quality framework. In2009 F ourth International Conference on Cooperation and Promotion of Information Resources in Science and Technology, pages 187–193. IEEE, 2009

work page 2009
[26]

Operational measurement of data quality

Antoon Bronselaer, Joachim Nielandt, Toon Boeckling, and Guy De Tré. Operational measurement of data quality. InInformation Processing and Management of Uncertainty in Knowledge-Based Systems. Applications: 17th International Conference, IPMU 2018, Cádiz, Spain, June 11-15, 2018, Proceedings, Part III 17, pages 517–528. Springer, 2018

work page 2018
[27]

Ontology-based data cleaning

Zoubida Kedad and Elisabeth Métais. Ontology-based data cleaning. InNatural Language Processing and Information Systems: 6th International Conference on Applications of Natural Language to Information Systems, NLDB 2002 Stockholm, Sweden, June 27–28, 2002 Revised Papers 6, pages 137–149. Springer, 2002

work page 2002
[28]

An ontology-based approach to data cleaning

Xin Wang, Howard John Hamilton, and Yashu Bither. An ontology-based approach to data cleaning. Department of Computer Science, University of Regina Regina, SK, Canada, 2005

work page 2005
[29]

Anchoring the consistency dimension of data quality using ontology in data integration.2009 Sixth Web Information Systems and Applications Conference, pages 201–205, 2009

Chen Wei-Liang, Zhang Shi-Dong, and Gao Xiang. Anchoring the consistency dimension of data quality using ontology in data integration.2009 Sixth Web Information Systems and Applications Conference, pages 201–205, 2009

work page 2009
[30]

An ontology-based approach for data cleaning

Paulo Oliveira, Maria de Fatima Rodrigues, and Pedro Rangel Henriques. An ontology-based approach for data cleaning. InICIQ, pages 307–320, 2006

work page 2006
[31]

Quality views: capturing and exploiting the user perspective on data quality

Paolo Missier, Suzanne Embury, Mark Greenwood, Alun Preece, and Binling Jin. Quality views: capturing and exploiting the user perspective on data quality. InProceedings of the International Conference on V ery Large Data Bases, volume 32, page 977, 2006

work page 2006
[32]

Using ontologies providing domain knowledge for data quality management

Stefan Brüggemann and Fabian Grüning. Using ontologies providing domain knowledge for data quality management. InNetworked Knowledge - Networked Media - Integrating Knowledge Management, pages 187–203. Springer, 2009

work page 2009
[33]

An efficient method of data quality using quality evaluation ontology.2008 Third International Conference on Convergence and Hybrid Information Technology, 2:1058–1061, 2008

O-Hoon Choi, Jun-Eun Lim, Hong-Seok Na, and Doo-Kwon Baik. An efficient method of data quality using quality evaluation ontology.2008 Third International Conference on Convergence and Hybrid Information Technology, 2:1058–1061, 2008

work page 2008
[34]

Building semantic mappings from databases to ontologies

Yuan An, John Mylopoulos, and Alexander Borgida. Building semantic mappings from databases to ontologies. InAAAI, pages 1557–1566, 2006

work page 2006
[35]

Using ontologies for xml data cleaning

Diego Milano, Monica Scannapieco, and Tiziana Catarci. Using ontologies for xml data cleaning. InOn the Move to Meaningful Internet Systems 2005: OTM 2005 Workshops: OTM Confederated Internationl Workshops and Posters, A WeSOMe, CAMS, GADA, MIOS+ INTEROP , ORM, PhDS, SeBGIS, SWWS, and WOSE 2005, Agia Napa, Cyprus, October 31-November 4, 2005. Proceedings,...

work page 2005
[36]

Automated data quality monitoring

Lisa Ehrlinger and Wolfram Wöß. Automated data quality monitoring. InProceedings of the 22nd MIT International Conference on Information Quality (ICIQ 2017), pages 15–1, 2017

work page 2017
[37]

Data quality ontology: an ontology for imperfect knowledge

Andrew U Frank. Data quality ontology: an ontology for imperfect knowledge. InCOSIT, pages 406–420, 2007

work page 2007
[38]

Towards an ontology for data quality in integrated chronic disease management: a realist review of the literature.International journal of medical informatics, 82(1):10–24, 2013

Siaw-Teng Liaw, Alireza Rahimi, Pradeep Ray, Jane Taggart, Sarah Dennis, Simon de Lusignan, B Jalaludin, AET Yeo, and Amir Talaei-Khoei. Towards an ontology for data quality in integrated chronic disease management: a realist review of the literature.International journal of medical informatics, 82(1):10–24, 2013

work page 2013
[39]

A data quality ontology for the secondary use of ehr data

Steven G Johnson, Stuart Speedie, Gyorgy Simon, Vipin Kumar, and Bonnie L Westra. A data quality ontology for the secondary use of ehr data. InAMIA Annual Symposium Proceedings, volume 2015, page 1937. American Medical Informatics Association, 2015

work page 2015
[40]

An ontology to assess data quality domains

Angelica Urrutia, Emma Chavez, Regina Motz, and Rosa Gajardo. An ontology to assess data quality domains. a case study applied to a health care entity.IEEE Latin America Transactions, 15(8):1506–1512, 2017

work page 2017
[41]

Ontology-based data quality framework for data stream applications

Sandra Geisler, Sven Weber, and Christoph Quix. Ontology-based data quality framework for data stream applications. InICIQ, 2011. 16 APREPRINT- FEBRUARY, 26 2024

work page 2011
[42]

What is a knowledge graph?, (accessed May 15, 2023)

Stanford University. What is a knowledge graph?, (accessed May 15, 2023). https://web.stanford.edu/ ~vinayc/kg/notes/What_is_a_Knowledge_Graph.html

work page 2023
[43]

Hadi Fadlallah, Yehia Taher, and Ali H. Jaber. Raden: A scalable and efficient radiation data engineering. In BDCSIntell, 2018

work page 2018
[44]

Hadi Fadlallah, Yehia Taher, Rafiqul Haque, and Ali H. Jaber. Oradiex: A big data driven smart framework for real-time surveillance and analysis of individual exposure to radioactive pollution. InBDCSIntell, 2019

work page 2019
[45]

25012:2008 software engineering — software product quality requirements and evaluation (square) — data quality model

ISO/IEC. 25012:2008 software engineering — software product quality requirements and evaluation (square) — data quality model. Standard, ISO/IEC, 2008

work page 2008
[46]

TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed Systems

Martín Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro, Greg S Corrado, Andy Davis, Jeffrey Dean, Matthieu Devin, et al. Tensorflow: Large-scale machine learning on heterogeneous distributed systems.arXiv preprint arXiv:1603.04467, 2016. 17

work page internal anchor Pith review Pith/arXiv arXiv 2016

[1] [1]

Data quality in context.Communications of the ACM, 40(5):103–110, 1997

Diane M Strong, Yang W Lee, and Richard Y Wang. Data quality in context.Communications of the ACM, 40(5):103–110, 1997

work page 1997

[2] [2]

Beyond accuracy: What data quality means to data consumers.J

Richard Y Wang and Diane Strong. Beyond accuracy: What data quality means to data consumers.J. Manag. Inf. Syst., 12:5–33, 1996

work page 1996

[3] [4]

Context-aware big data quality assessment: A scoping review.J

Hadi Fadlallah, Rima Kilany, Houssein Dhayne, Rami el Haddad, Rafiqul Haque, Yehia Taher, and Ali Jaber. Context-aware big data quality assessment: A scoping review.J. Data and Information Quality, jun 2023. Just Accepted

work page 2023

[4] [5]

Bigqa: Declarative big data quality assessment.J

Hadi Fadlallah, Rima Kilany, Houssein Dhayne, Rami el Haddad, Rafiqul Haque, Yehia Taher, and Ali Jaber. Bigqa: Declarative big data quality assessment.J. Data and Information Quality, jun 2023. Just Accepted

work page 2023

[5] [6]

Iso/iec 15939:2017 systems and software engineering — measurement process

ISO/IEC. Iso/iec 15939:2017 systems and software engineering — measurement process. Standard, ISO/IEC, 2017

work page 2017

[6] [7]

Iso/iec 25000:2014

ISO/IEC. Iso/iec 25000:2014. systems and software engineering – system and software quality requirements and evaluation (square) – guide to square. Standard, ISO/IEC, 2014

work page 2014

[7] [8]

Iso/iec 20547-3:2020 big data reference architecture - part 3:reference architecture

ISO/IEC. Iso/iec 20547-3:2020 big data reference architecture - part 3:reference architecture. Standard, ISO/IEC, 2020

work page 2020

[8] [9]

node2vec: Scalable feature learning for networks

Aditya Grover and Jure Leskovec. node2vec: Scalable feature learning for networks. InProceedings of the 22nd ACM SIGKDD international conference on Knowledge discovery and data mining, pages 855–864, 2016

work page 2016

[9] [10]

K-nearest neighbor.Scholarpedia, 4(2):1883, 2009

Leif E Peterson. K-nearest neighbor.Scholarpedia, 4(2):1883, 2009

work page 2009

[10] [11]

Accenture/ampligraph: Ampligraph 2.0.0, March 2023

Luca Costabello, Alberto Bernardi, Adrianna Janik, and Sumit Pai. Accenture/ampligraph: Ampligraph 2.0.0, March 2023

work page 2023

[11] [12]

Survey on graph embeddings and their applications to machine learning problems on graphs.PeerJ Computer Science, 7:e357, 2021

Ilya Makarov, Dmitrii Kiselev, Nikita Nikitinsky, and Lovro Subelj. Survey on graph embeddings and their applications to machine learning problems on graphs.PeerJ Computer Science, 7:e357, 2021

work page 2021

[12] [13]

A survey on knowledge graph embedding: Approaches, applications and benchmarks.Electronics, 9(5):750, 2020

Yuanfei Dai, Shiping Wang, Neal N Xiong, and Wenzhong Guo. A survey on knowledge graph embedding: Approaches, applications and benchmarks.Electronics, 9(5):750, 2020

work page 2020

[13] [14]

Learning embeddings from knowledge graphs with numeric edge attributes.arXiv preprint arXiv:2105.08683, 2021

Sumit Pai and Luca Costabello. Learning embeddings from knowledge graphs with numeric edge attributes.arXiv preprint arXiv:2105.08683, 2021

work page arXiv 2021

[14] [15]

Complex embeddings for simple link prediction

Théo Trouillon, Johannes Welbl, Sebastian Riedel, Éric Gaussier, and Guillaume Bouchard. Complex embeddings for simple link prediction. InInternational conference on machine learning, pages 2071–2080. PMLR, 2016

work page 2071

[15] [16]

Value-driven data quality assessment

Adir Even and Ganesan Shankaranarayanan. Value-driven data quality assessment. InICIQ, 2005

work page 2005

[16] [17]

Utility-driven assessment of data quality.ACM SIGMIS Database: the DATABASE for Advances in Information Systems, 38(2):75–93, 2007

Adir Even and Ganesan Shankaranarayanan. Utility-driven assessment of data quality.ACM SIGMIS Database: the DATABASE for Advances in Information Systems, 38(2):75–93, 2007

work page 2007

[17] [18]

Big data pre-processing: A quality framework

Ikbal Taleb, Rachida Dssouli, and Mohamed Adel Serhani. Big data pre-processing: A quality framework. In 2015 IEEE international congress on big data, pages 191–198. IEEE, 2015

work page 2015

[18] [19]

Big data quality assessment model for unstructured data

Ikbal Taleb, Mohamed Adel Serhani, and Rachida Dssouli. Big data quality assessment model for unstructured data. In2018 International Conference on Innovations in Information Technology (IIT), pages 69–74. IEEE, 2018

work page 2018

[19] [20]

The challenges of data quality and data quality assessment in the big data era.Data science journal, 14, 2015

Li Cai and Yangyong Zhu. The challenges of data quality and data quality assessment in the big data era.Data science journal, 14, 2015. 15 APREPRINT- FEBRUARY, 26 2024

work page 2015

[20] [21]

A data quality in use model for big data.Future Generation Computer Systems, 63:123–130, 2016

Jorge Merino, Ismael Caballero, Bibiano Rivas, Manuel Serrano, and Mario Piattini. A data quality in use model for big data.Future Generation Computer Systems, 63:123–130, 2016

work page 2016

[21] [22]

Evaluating the quality of social media data in big data architecture.Ieee Access, 3:2028–2043, 2015

Anne Immonen, Pekka Pääkkönen, and Eila Ovaska. Evaluating the quality of social media data in big data architecture.Ieee Access, 3:2028–2043, 2015

work page 2028

[22] [23]

Reference architecture and classification of technologies, products and services for big data systems.Big data research, 2(4):166–186, 2015

Pekka Pääkkönen and Daniel Pakkala. Reference architecture and classification of technologies, products and services for big data systems.Big data research, 2(4):166–186, 2015

work page 2015

[23] [24]

A data quality methodology for heterogeneous data.International Journal of Database Management Systems, 3(1):60–79, 2011

Carlo Batini, Federico Cabitza, Daniele Barone, Federico Cabitza, and Simone Grega. A data quality methodology for heterogeneous data.International Journal of Database Management Systems, 3(1):60–79, 2011

work page 2011

[24] [25]

A context aware information quality framework

Markus Helfert and Owen Foley. A context aware information quality framework. In2009 F ourth International Conference on Cooperation and Promotion of Information Resources in Science and Technology, pages 187–193. IEEE, 2009

work page 2009

[25] [26]

Operational measurement of data quality

Antoon Bronselaer, Joachim Nielandt, Toon Boeckling, and Guy De Tré. Operational measurement of data quality. InInformation Processing and Management of Uncertainty in Knowledge-Based Systems. Applications: 17th International Conference, IPMU 2018, Cádiz, Spain, June 11-15, 2018, Proceedings, Part III 17, pages 517–528. Springer, 2018

work page 2018

[26] [27]

Ontology-based data cleaning

Zoubida Kedad and Elisabeth Métais. Ontology-based data cleaning. InNatural Language Processing and Information Systems: 6th International Conference on Applications of Natural Language to Information Systems, NLDB 2002 Stockholm, Sweden, June 27–28, 2002 Revised Papers 6, pages 137–149. Springer, 2002

work page 2002

[27] [28]

An ontology-based approach to data cleaning

Xin Wang, Howard John Hamilton, and Yashu Bither. An ontology-based approach to data cleaning. Department of Computer Science, University of Regina Regina, SK, Canada, 2005

work page 2005

[28] [29]

Anchoring the consistency dimension of data quality using ontology in data integration.2009 Sixth Web Information Systems and Applications Conference, pages 201–205, 2009

Chen Wei-Liang, Zhang Shi-Dong, and Gao Xiang. Anchoring the consistency dimension of data quality using ontology in data integration.2009 Sixth Web Information Systems and Applications Conference, pages 201–205, 2009

work page 2009

[29] [30]

An ontology-based approach for data cleaning

Paulo Oliveira, Maria de Fatima Rodrigues, and Pedro Rangel Henriques. An ontology-based approach for data cleaning. InICIQ, pages 307–320, 2006

work page 2006

[30] [31]

Quality views: capturing and exploiting the user perspective on data quality

Paolo Missier, Suzanne Embury, Mark Greenwood, Alun Preece, and Binling Jin. Quality views: capturing and exploiting the user perspective on data quality. InProceedings of the International Conference on V ery Large Data Bases, volume 32, page 977, 2006

work page 2006

[31] [32]

Using ontologies providing domain knowledge for data quality management

Stefan Brüggemann and Fabian Grüning. Using ontologies providing domain knowledge for data quality management. InNetworked Knowledge - Networked Media - Integrating Knowledge Management, pages 187–203. Springer, 2009

work page 2009

[32] [33]

An efficient method of data quality using quality evaluation ontology.2008 Third International Conference on Convergence and Hybrid Information Technology, 2:1058–1061, 2008

O-Hoon Choi, Jun-Eun Lim, Hong-Seok Na, and Doo-Kwon Baik. An efficient method of data quality using quality evaluation ontology.2008 Third International Conference on Convergence and Hybrid Information Technology, 2:1058–1061, 2008

work page 2008

[33] [34]

Building semantic mappings from databases to ontologies

Yuan An, John Mylopoulos, and Alexander Borgida. Building semantic mappings from databases to ontologies. InAAAI, pages 1557–1566, 2006

work page 2006

[34] [35]

Using ontologies for xml data cleaning

Diego Milano, Monica Scannapieco, and Tiziana Catarci. Using ontologies for xml data cleaning. InOn the Move to Meaningful Internet Systems 2005: OTM 2005 Workshops: OTM Confederated Internationl Workshops and Posters, A WeSOMe, CAMS, GADA, MIOS+ INTEROP , ORM, PhDS, SeBGIS, SWWS, and WOSE 2005, Agia Napa, Cyprus, October 31-November 4, 2005. Proceedings,...

work page 2005

[35] [36]

Automated data quality monitoring

Lisa Ehrlinger and Wolfram Wöß. Automated data quality monitoring. InProceedings of the 22nd MIT International Conference on Information Quality (ICIQ 2017), pages 15–1, 2017

work page 2017

[36] [37]

Data quality ontology: an ontology for imperfect knowledge

Andrew U Frank. Data quality ontology: an ontology for imperfect knowledge. InCOSIT, pages 406–420, 2007

work page 2007

[37] [38]

Towards an ontology for data quality in integrated chronic disease management: a realist review of the literature.International journal of medical informatics, 82(1):10–24, 2013

Siaw-Teng Liaw, Alireza Rahimi, Pradeep Ray, Jane Taggart, Sarah Dennis, Simon de Lusignan, B Jalaludin, AET Yeo, and Amir Talaei-Khoei. Towards an ontology for data quality in integrated chronic disease management: a realist review of the literature.International journal of medical informatics, 82(1):10–24, 2013

work page 2013

[38] [39]

A data quality ontology for the secondary use of ehr data

Steven G Johnson, Stuart Speedie, Gyorgy Simon, Vipin Kumar, and Bonnie L Westra. A data quality ontology for the secondary use of ehr data. InAMIA Annual Symposium Proceedings, volume 2015, page 1937. American Medical Informatics Association, 2015

work page 2015

[39] [40]

An ontology to assess data quality domains

Angelica Urrutia, Emma Chavez, Regina Motz, and Rosa Gajardo. An ontology to assess data quality domains. a case study applied to a health care entity.IEEE Latin America Transactions, 15(8):1506–1512, 2017

work page 2017

[40] [41]

Ontology-based data quality framework for data stream applications

Sandra Geisler, Sven Weber, and Christoph Quix. Ontology-based data quality framework for data stream applications. InICIQ, 2011. 16 APREPRINT- FEBRUARY, 26 2024

work page 2011

[41] [42]

What is a knowledge graph?, (accessed May 15, 2023)

Stanford University. What is a knowledge graph?, (accessed May 15, 2023). https://web.stanford.edu/ ~vinayc/kg/notes/What_is_a_Knowledge_Graph.html

work page 2023

[42] [43]

Hadi Fadlallah, Yehia Taher, and Ali H. Jaber. Raden: A scalable and efficient radiation data engineering. In BDCSIntell, 2018

work page 2018

[43] [44]

Hadi Fadlallah, Yehia Taher, Rafiqul Haque, and Ali H. Jaber. Oradiex: A big data driven smart framework for real-time surveillance and analysis of individual exposure to radioactive pollution. InBDCSIntell, 2019

work page 2019

[44] [45]

25012:2008 software engineering — software product quality requirements and evaluation (square) — data quality model

ISO/IEC. 25012:2008 software engineering — software product quality requirements and evaluation (square) — data quality model. Standard, ISO/IEC, 2008

work page 2008

[45] [46]

TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed Systems

Martín Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro, Greg S Corrado, Andy Davis, Jeffrey Dean, Matthieu Devin, et al. Tensorflow: Large-scale machine learning on heterogeneous distributed systems.arXiv preprint arXiv:1603.04467, 2016. 17

work page internal anchor Pith review Pith/arXiv arXiv 2016