A Hybrid Data Cleaning Framework using Markov Logic Networks

Bin Yao; Congcong Ge; Haobo Wang; Qing Li; Xiaoye Miao; Yunjun Gao

arxiv: 1903.05826 · v1 · pith:QJUA46OXnew · submitted 2019-03-14 · 💻 cs.DB

A Hybrid Data Cleaning Framework using Markov Logic Networks

Yunjun Gao , Congcong Ge , Xiaoye Miao , Haobo Wang , Bin Yao , Qing Li This is my paper

classification 💻 cs.DB

keywords datacleaningmlncleanconceptsframeworkhybridlogicmarkov

0 comments

read the original abstract

With the increase of dirty data, data cleaning turns into a crux of data analysis. Most of the existing algorithms rely on either qualitative techniques (e.g., data rules) or quantitative ones (e.g., statistical methods). In this paper, we present a novel hybrid data cleaning framework on top of Markov logic networks (MLNs), termed as MLNClean, which is capable of cleaning both schema-level and instance-level errors. MLNClean mainly consists of two cleaning stages, namely, first cleaning multiple data versions separately (each of which corresponds to one data rule), and then deriving the final clean data based on multiple data versions. Moreover, we propose a series of techniques/concepts, e.g., the MLN index, the concepts of reliability score and fusion score, to facilitate the cleaning process. Extensive experimental results on both real and synthetic datasets demonstrate the superiority of MLNClean to the state-of-the-art approach in terms of both accuracy and efficiency.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Collaborative Large and Small Language Models for Accurate and Scalable Data Repair
cs.DB 2026-06 unverdicted novelty 6.0

LasRepair++ pairs an LLM instructor with an SLM corrector, refines context via EM, and down-weights uncertain repairs using column-calibrated confidence, reporting 18.1% average F1 gain over baselines on data repair tasks.