Database Querying under Missing Values Governed by Missingness Mechanisms
Pith reviewed 2026-05-10 17:48 UTC · model grok-4.3
The pith
Missing values governed by a Bayesian network missingness graph allow construction of a block-independent probabilistic database for query answering.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
By representing the missingness mechanism as a Missingness Graph (MG), which is a Bayesian network, the observed database can be used to construct a block-independent probabilistic database. This construction supports two specific query answering techniques that jointly account for probabilistic uncertainty and the statistical plausibility of implicit imputations of missing values, along with complexity characterizations of their computational feasibility.
What carries the argument
The Missingness Graph (MG), a Bayesian network that models dependencies among database attributes regarding why values are missing, which enables transforming the observed database into a block-independent probabilistic database suitable for the proposed query answering methods.
If this is right
- The MG and observed DB allow building a block-independent probabilistic DB.
- Two QA techniques capture both probabilistic uncertainty and statistical plausibility of MV imputation.
- Complexity results characterize the feasibility of the QA approaches.
- The approach departs from traditional NULL-based treatments of missing data in RDBs.
Where Pith is reading between the lines
- Integrating domain-specific knowledge about missingness causes could enhance query accuracy in specialized databases.
- This method might be extended by learning the Missingness Graph automatically from data patterns.
- Block-independence could allow efficient query optimization techniques in larger systems.
Load-bearing premise
The missingness mechanism can be accurately modeled as a Bayesian network involving the database attributes, and the observed data together with this model is sufficient to build a block-independent probabilistic database for query answering.
What would settle it
Finding a dataset where the missingness follows a known mechanism but queries answered via the constructed probabilistic DB give different results from the true complete data distribution, or where the Bayesian network fails to represent the actual dependencies in missingness.
Figures
read the original abstract
We address the problems of giving a semantics to- and doing query answering (QA) on a relational database (RDB) that has missing values (MVs). The causes for the latter are governed by a Missingness Mechanism that is modelled as a Bayesian Network, which represents a Missingness Graph (MG) and involves the DB attributes. Our approach considerable departs from the treatment of RDBs with NULL (values). The MG together with the observed DB allow to build a block-independent probabilistic DB, on which basis we propose two QA techniques that jointly capture probabilistic uncertainty and statistical plausibility of the implicit imputation of MVs. We obtain complexity results that characterize the computational feasibility of those approaches.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that missing values in relational databases can be handled by modeling their missingness mechanism as a Bayesian Network (the Missingness Graph or MG) over database attributes. Combined with the observed database, this MG yields a block-independent probabilistic database via factorization of the joint distribution over possible worlds. Two query-answering techniques are then defined on this structure that jointly capture probabilistic uncertainty and the statistical plausibility of implicit imputations; complexity results (PTIME vs. #P-hard) follow from known results on block-independent PDBs.
Significance. If the central construction holds, the work supplies a principled, mechanism-aware semantics for missing data that departs from NULL-based treatments and directly reduces query answering to standard probabilistic evaluation on block-independent PDBs. The resulting complexity characterizations are immediately useful for assessing feasibility, and the explicit use of the missingness graph as a generative model offers a clear route to incorporating domain knowledge about why values are missing.
minor comments (2)
- [Abstract] Abstract: the phrase 'Our approach considerable departs' is grammatically incorrect and should read 'Our approach considerably departs'.
- [Section 3 (construction)] The factorization of the joint distribution over possible worlds (used to establish block independence) is central; a short illustrative example in the main text would help readers verify that the MG induces the claimed independence structure without additional assumptions.
Simulated Author's Rebuttal
We thank the referee for the positive summary, significance assessment, and recommendation of minor revision. The report accurately captures the core contribution of modeling missingness mechanisms via a Bayesian network (Missingness Graph) to obtain a block-independent probabilistic database for query answering. No specific major comments or requested changes were provided in the report.
Circularity Check
No significant circularity
full rationale
The paper models the missingness mechanism explicitly as a Bayesian Network (Missingness Graph) over DB attributes, then defines the block-independent probabilistic DB via the factorization of the joint distribution over possible worlds consistent with the observed data and the MG. The two QA techniques are reductions to standard probabilistic query evaluation on this structure, and the PTIME/#P-hard complexity results follow directly from known results on block-independent PDBs once the construction is granted. No equation or claim reduces by construction to a fitted parameter, self-definition, or load-bearing self-citation; the central steps are formally defined and self-contained against external probabilistic database benchmarks.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Missing values in the database are governed by a missingness mechanism that can be modeled as a Bayesian Network represented by a Missingness Graph involving the database attributes.
invented entities (2)
-
Missingness Graph (MG)
no independent evidence
-
block-independent probabilistic DB
no independent evidence
Reference graph
Works this paper leans on
-
[1]
Ahuja, R. K., Magnanti, T. L. and Orlin, J.B. \ Network Flows: Theory, Algorithms, and Applications . Prentice Hall, 1993
work page 1993
-
[2]
Ali, S. M. and Silvey, S. D. A General Class of Coefficients of Divergence of one Distribution from Another. Journal of the Royal Statistical Society, Series B , 1996, 28(1):131--142
work page 1996
-
[3]
Allison, P. D. \ Maximum Likelihood Estimation . \ \ Quantitative Applications in the Social Sciences, 96. Sage Publications, 1993
work page 1993
-
[4]
Allison, P. D. \ Missing Data . \ Quantitative Applications in the Social Sciences, 136. Sage Publications, 2002
work page 2002
- [5]
-
[6]
Bertossi, L. and Bravo, L. \ Consistency and Trust in Peer Data Exchange Systems. Theory and Practice of Logic Programming , 2017, 17(2):148-204
work page 2017
-
[7]
Bishop, C. M. \ Pattern Recognition and Machine Learning . \ Springer, 2006
work page 2006
-
[8]
\ A Fast Algorithm for Enumerating Bipartite Perfect Matchings., ISAAC 2001 (LNCS vol 2223), 2001
Takeaki, U. \ A Fast Algorithm for Enumerating Bipartite Perfect Matchings., ISAAC 2001 (LNCS vol 2223), 2001
work page 2001
-
[9]
Chen, L., Kyng, R., Liu, Y. P., Peng, R., Gutenberg, M. P., & Sachdeva, S. (2022). Minimum cost flows, MDPs, and \( _1\)-regression in nearly linear time for separable convex instances. In Proceedings of the 54th Annual ACM SIGACT Symposium on Theory of Computing (STOC 2022)
work page 2022
-
[10]
Console, M, Libkin, L. and Peterfreund, L. \ Querying Incomplete Numerical Data: Between Certain and Possible Answers. 2022, CoRR abs/2210.15395
-
[11]
Dalvi, N. and Suciu, D. \ Efficient Query Evaluation on Probabilistic Databases. In Proc VLDB 2004
work page 2004
-
[12]
Das Sarma, A., Benjelloun, O., Halevy, A. and Widom, J. Working Models for Uncertain Data. Proc ICDE 2006
work page 2006
-
[13]
Devore, J. L., Berk, K. N. and Carlton, M. A. \ Modern Mathematical Statistics with Applications . 3rd Ed., Springer, 2021
work page 2021
-
[14]
Darwiche, A. \ Bayesian Networks. Communications of the ACM , 2010, 53(12):80-90
work page 2010
-
[15]
\ Reasoning with Probabilistic and Deterministic Graphical Models
Dechter, R. \ Reasoning with Probabilistic and Deterministic Graphical Models . \ 2nd Ed., Synthesis Lectures on Artificial Intelligence and Machine Learning, Morgan & Claypool Pubs., 2019
work page 2019
-
[16]
De Sa, C., Ilyas, I., Kimelfeld, B., Re, C. and Rekatsinas, T. \ A Formal Framework for Probabilistic Unclean Databases. Proc. ICDT 2019, pp. 6:1-6:18
work page 2019
-
[17]
Dinic, E. A. Algorithm for Solution of a Problem of Maximum Flow in a Network with Power Estimation. Soviet Mathematics Doklady, 1970, 11:1277--1280
work page 1970
-
[18]
aGrUM/pyAgrum: a toolbox to build models and algorithms for Probabilistic Graphical Models in Python
Ducamp, G., Gonzales, C., Wuillemin, P.-H. aGrUM/pyAgrum: a toolbox to build models and algorithms for Probabilistic Graphical Models in Python. Proc. International Conference on Probabilistic Graphical Models, 2020, PMLR
work page 2020
-
[19]
\ Aggregation in Probabilistic Databases via Knowledge Compilation
Fink , R., Han, L., Olteanu, D. \ Aggregation in Probabilistic Databases via Knowledge Compilation. Proc. VLDB Endow. 5(5): 490-501 (2012)
work page 2012
-
[20]
Gelman, A. and Hill, J. \ Data Analysis Using Regression and Multilevel/Hierarchical Models . Cambridge Univ. Press, 2007
work page 2007
-
[21]
Gheerbrant, A., Libkin, L., Rogova, A. and Sirangelo, C. \ Querying Incomplete Data: Complexity and Tractability via Datalog and First-Order Rewritings. \ Theory and Practice of Logic Programming. \ Published online on 28 November 2023
work page 2023
-
[22]
Gilad, A., Imber, A. and Kimelfeld, B. \ The Consistency of Probabilistic Databases with Independent Cells. \ Proc. ICDT, 2023
work page 2023
-
[23]
Greco, S., Molinaro, C. and Spezzano, F. \ Incomplete Data and Data Dependencies in Relational Databases. Synthesis Lectures in Data Management, Morgan & Claypool Pubs., 2012
work page 2012
-
[24]
Gribkoff, E., Van den Broeck, G. and Suciu, D. \ The Most Probable Database Problem. \ Proc. BUDA 2014, collocated with SIGMOD, 2014
work page 2014
-
[25]
Fukuda, K. , Matsui, T. \ Finding all the perfect matchings in bipartite graphs. Applied Mathematics Letters . 1994
work page 1994
-
[26]
Hastie, T., Tibshirani, R. and Friedman, J. \ The Elements of Statistical Learning . 2nd Ed., Springer, 2017
work page 2017
-
[27]
Imielinski, T. and Lipski Jr., W. \ Incomplete Information in Relational Databases. \ Journal of the ACM , 1984, 31(4):761-791
work page 1984
-
[28]
Khosravi, P., Liang, Y., Choi, Y. and Van den Broeck, G. \ What to Expect of Classifiers? Reasoning about Logistic Regression with Missing Features. \ Proc. IJCAI, 2019, pp. 2716-2724
work page 2019
-
[29]
Kuhn, H. W. \ The Hungarian Method for the Assignment Problem. \ Naval Research Logistics Quarterly , 1955, 2(1-2):83-97
work page 1955
-
[30]
Kumar Jha, A. and Dan Suciu, D. \ Probabilistic Databases with MarkoViews. Proc. VLDB 5(11):1160-1171, 2012
work page 2012
-
[31]
Little, R. J. and Rubin, D. B. \ Statistical Analysis with Missing Data . \ John Wiley & Sons, 3rd Ed., 2019
work page 2019
-
[32]
Littman, M. L. and Goldsmith, J. and Mundhenk, M. \ The Computational Complexity of Probabilistic Planning , JAIR, 1998
work page 1998
-
[33]
MacKay, D. J. \ Information Theory, Inference, and Learning Algorithms . Cambridge University Press, 2003
work page 2003
-
[34]
Solving Integer Minimum Cost Flows with Separable Convex Cost Objective Polynomially
Minoux, M.. Solving Integer Minimum Cost Flows with Separable Convex Cost Objective Polynomially. In: Gallo, G., Sandi, C. (eds) `Netflow at Pisa'. Mathematical Programming Studies, Springer 1986, vol 26
work page 1986
-
[35]
Mohan, K., Pearl, J. and Tian, J. \ Graphical Models for Inference with Missing Data. \ Proc NIPS , 2013, pp. 1277-1285
work page 2013
-
[36]
On the Testability of Models with Missing Data
Mohan, K and Pearl, J. On the Testability of Models with Missing Data. \ Proc. AISTAT, 2014
work page 2014
-
[37]
\ Graphical Models for Inference with Missing Data
Mohan, K. \ Graphical Models for Inference with Missing Data. \ PhD Thesis, UCLA, 2017. https://escholarship.org/uc/item/6mk2b174
work page 2017
-
[38]
Mohan, K. and Pearl, J. \ Graphical Models for Processing Missing Data. \ Journal of the American Statistical Association , 2021, 116(534):1023–1037
work page 2021
-
[39]
Olteanu, D. and Z\' a vodn\' y , J. \ Size Bounds for Factorised Representations of Query Results. \ ACM TODS , 2015, 40(1)
work page 2015
-
[40]
\ Computational Complexity , Addison-Wesley, Reading, MA, 1994
Papadimitriou, C. \ Computational Complexity , Addison-Wesley, Reading, MA, 1994
work page 1994
-
[41]
\ Causality: Models, Reasoning and Inference
Pearl, J. \ Causality: Models, Reasoning and Inference . \ Cambridge Univ. Press, 2nd edition, 2009
work page 2009
-
[42]
\ The Foundations of Causal Inference
Pearl, J. \ The Foundations of Causal Inference. \ Sociological Methodology , 2010, 40:75–149. http://www.jstor.org/stable/41336883
-
[43]
R\' e , C., Dalvi, N. and Suciu, D. \ Efficient Top-k Query Evaluation on Probabilistic Data. \ Proc. ICDE, 2007, pp. 886-895
work page 2007
-
[44]
R\'e, C., Dalvi, N. and Naughton, J. F. \ Trio: A System for Integrated Management of Data, Accuracy, and Lineage. Proc. CIDR, 2005, pp. 262–276
work page 2005
-
[45]
Reiter, R.\ A Sound and Sometimes Complete Query Evaluation Algorithm for Relational Databases with Null Values. \ J. of the ACM , 1986, 33(2):349-370
work page 1986
-
[46]
\ Factorised representations of query results: size bounds and readability.\ Proc ICDT 2012
Olteanu, Dan and Z\' a vodn\' y , Jakub. \ Factorised representations of query results: size bounds and readability.\ Proc ICDT 2012
work page 2012
-
[47]
\ The Trichotomy of HAVING Queries on a Probabilistic Database
R\' e , C., Suciu, D. \ The Trichotomy of HAVING Queries on a Probabilistic Database. \ VLDB J. , 2009, 18(5):1091-1116
work page 2009
-
[48]
Royden, H. L. Real Analysis , 3rd ed. Macmillan, New York, 1988
work page 1988
-
[49]
Salimi, B., Rodriguez, L., Howe, B. and Suciu, D. \ Interventional Fairness: Causal Database Repair for Algorithmic Fairness. \ Proc. SIGMOD, 2019, pp. 793–810
work page 2019
-
[50]
Salimi, B., Milani, M., Pirhadi, A., Cloninger, A. and Moslemi, M. H. \ OTClean: Data Cleaning for Conditional Independence Violations using Optimal Transport. \ Proc. SIGMOD, 2024
work page 2024
-
[51]
Senellart, P., Jachiet, L., Maniu, S. and Ramusat, Y. \ ProvSQL: Provenance and Probability Management in PostgreSQL. \ Proc VLDB, 2018
work page 2018
-
[52]
\ Top-k Query Processing in Uncertain Databases
Soliman, M., Ilyas, I and Chang, K. \ Top-k Query Processing in Uncertain Databases. In Proc. ICDE 2007
work page 2007
-
[53]
Soliman, M., Ilyas, I. and Chang, K. \ Probabilistic Top-k and Ranking-Aggregate Queries. ACM TODS , 2008, 33(3)
work page 2008
-
[54]
Strozecki, Y. \ Enumeration Complexity: Incremental Time, Delay and Space. arXiv, 2023., https://arxiv.org/abs/2309.17042
-
[55]
Suciu, D., Olteanu, D., R\'e, C. and Koch, C. \ Probabilistic Databases . \ Synthesis Lectures on Data Management, Morgan & Claypool Pubs., 2011
work page 2011
-
[56]
Complexity Classes Defined by Counting Quantifiers
Tor\' a n, J. Complexity Classes Defined by Counting Quantifiers. J. ACM. , 1991
work page 1991
-
[57]
Toussaint, E., Guagliardo, P., Libkin, L. and Sequeda, J. \ Troubles with Nulls, Views from the Users. Proc. VLDB Endow., 2022, 15(11):2613-2625
work page 2022
-
[58]
Van den Broeck, G., Mohan, K., and Choi, A., Darwiche, A. and Pearl, J. Efficient Algorithms for Bayesian Network Parameter Learning from Incomplete Data. Proc. UAI, 2013
work page 2013
-
[59]
Van den Broeck, G. and Suciu, D. \ Query Processing on Probabilistic Data: A Survey. \ Foundations and Trends in Databases , 2015, 7(3-4):197–341. \ NOW Publishers
work page 2015
-
[60]
Valiant, L. G. \ The Complexity of Computing the Permanent. Theoretical Computer Science , 8(2):189--201, 1979
work page 1979
- [61]
-
[62]
Végh, L. A. (2016). A strongly polynomial algorithm for a class of minimum-cost flow problems with separable convex objectives. SIAM Journal on Computing, 45(5), 1729--1761
work page 2016
-
[63]
Wong, S. K. M. and Wang, Z. W. \ On Axiomatization of Probabilistic Conditional Independence. \ Proc. Conf. Uncertainty in Artificial Intelligence, 1994, pp. 591–597
work page 1994
-
[64]
Wong, S. K. M., Butz, C. J. and Wu, D. \ On the Implication Problem for Probabilistic Conditional Independency. \ IEEE Transactions on Systems, Man, and Cybernetics, Part A , 2000, 30(6):785-805
work page 2000
-
[65]
https://stats.oarc.ucla.edu/stata/seminars/mi_in_stata_pt1_new
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.