pith. sign in

arxiv: 2604.06520 · v1 · submitted 2026-04-07 · 💻 cs.DB · cs.AI· cs.LO

Database Querying under Missing Values Governed by Missingness Mechanisms

Pith reviewed 2026-05-10 17:48 UTC · model grok-4.3

classification 💻 cs.DB cs.AIcs.LO
keywords database queryingmissing valuesmissingness mechanismsBayesian networksprobabilistic databasesquery answeringmissingness graph
0
0 comments X

The pith

Missing values governed by a Bayesian network missingness graph allow construction of a block-independent probabilistic database for query answering.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper addresses query answering on relational databases with missing values whose absence is explained by a missingness mechanism modeled as a Bayesian network, or Missingness Graph, over the attributes. This model combined with the observed data enables building a block-independent probabilistic database. Two query answering techniques are introduced that incorporate both the probabilistic nature of the uncertainty and the statistical plausibility of imputing the missing values. Complexity results are provided to determine when these techniques are computationally feasible. This framework offers a semantics for incomplete databases that goes beyond simple NULL handling by explicitly using the causes of missingness.

Core claim

By representing the missingness mechanism as a Missingness Graph (MG), which is a Bayesian network, the observed database can be used to construct a block-independent probabilistic database. This construction supports two specific query answering techniques that jointly account for probabilistic uncertainty and the statistical plausibility of implicit imputations of missing values, along with complexity characterizations of their computational feasibility.

What carries the argument

The Missingness Graph (MG), a Bayesian network that models dependencies among database attributes regarding why values are missing, which enables transforming the observed database into a block-independent probabilistic database suitable for the proposed query answering methods.

If this is right

  • The MG and observed DB allow building a block-independent probabilistic DB.
  • Two QA techniques capture both probabilistic uncertainty and statistical plausibility of MV imputation.
  • Complexity results characterize the feasibility of the QA approaches.
  • The approach departs from traditional NULL-based treatments of missing data in RDBs.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Integrating domain-specific knowledge about missingness causes could enhance query accuracy in specialized databases.
  • This method might be extended by learning the Missingness Graph automatically from data patterns.
  • Block-independence could allow efficient query optimization techniques in larger systems.

Load-bearing premise

The missingness mechanism can be accurately modeled as a Bayesian network involving the database attributes, and the observed data together with this model is sufficient to build a block-independent probabilistic database for query answering.

What would settle it

Finding a dataset where the missingness follows a known mechanism but queries answered via the constructed probabilistic DB give different results from the true complete data distribution, or where the Bayesian network fails to represent the actual dependencies in missingness.

Figures

Figures reproduced from arXiv: 2604.06520 by Farouk Toumani, Leopoldo Bertossi, Maxime Buron.

Figure 1
Figure 1. Figure 1: Missingness Graph. Example 2. (Ex. 1 cont.) For simplicity, assume now that attribute A is fully observed and denoted with Ao . The MG in [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Three Missingness Graphs. The MG in [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: (a) Missingness graph. (b) Re￾sulting BID. For this MG, the probabilities are calculated as in Example 6, obtaining, for example, for the tuples in the first block of Rp : p(τ 1 1 ) = PM(Mm 1 = 1|A = a, B = b) × PM(A = a, B = b), p(τ 1 2 ) = PM(Mm 1 = 0|A = a, B = b) × PM(A = a, B = b). Accordingly, we define the probabilities in the MG in such a way that p(τ 1 1 ) = p1, p(τ 1 2 ) = (1 − p1), etc. Now, con… view at source ↗
Figure 4
Figure 4. Figure 4: A arbitrary BID B To construct the proof, we start with a BID D over a schema S = {T} and, in polynomial time, build an incomplete database D∗ , an MG M, and a join distribution P M. From this construction, we derive a BID instance Dp ≡ D. The database D∗ is defined over the schema {T ∗ , A∗ 1 , . . . , A∗ n}. The attribute T has domain dom(T) = {t1, . . . , tn}, while each attribute Ai (for i ∈ {1, . . . … view at source ↗
Figure 7
Figure 7. Figure 7: The incomplete database D∗ tuple_id Tm Am 1 . . . Am n A∗ 1 I A1 . . . A∗ n I An T∗ I T Prob J 1 B1 t1 in(t1, B1) . . . in(t1, Bn) 1 0 . . . na 1 na 1 p 1 1 n . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . J n B1 tn in(tn, B1) . . . in(tn, Bn) 1 0 . . . na 1 na 1 p n 1 n . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . J n Bn tn in(tn, Bn) . . . i… view at source ↗
Figure 8
Figure 8. Figure 8: The MG M The join distribution PM, illustrated in [PITH_FULL_IMAGE:figures/full_fig_p022_8.png] view at source ↗
read the original abstract

We address the problems of giving a semantics to- and doing query answering (QA) on a relational database (RDB) that has missing values (MVs). The causes for the latter are governed by a Missingness Mechanism that is modelled as a Bayesian Network, which represents a Missingness Graph (MG) and involves the DB attributes. Our approach considerable departs from the treatment of RDBs with NULL (values). The MG together with the observed DB allow to build a block-independent probabilistic DB, on which basis we propose two QA techniques that jointly capture probabilistic uncertainty and statistical plausibility of the implicit imputation of MVs. We obtain complexity results that characterize the computational feasibility of those approaches.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

0 major / 2 minor

Summary. The paper claims that missing values in relational databases can be handled by modeling their missingness mechanism as a Bayesian Network (the Missingness Graph or MG) over database attributes. Combined with the observed database, this MG yields a block-independent probabilistic database via factorization of the joint distribution over possible worlds. Two query-answering techniques are then defined on this structure that jointly capture probabilistic uncertainty and the statistical plausibility of implicit imputations; complexity results (PTIME vs. #P-hard) follow from known results on block-independent PDBs.

Significance. If the central construction holds, the work supplies a principled, mechanism-aware semantics for missing data that departs from NULL-based treatments and directly reduces query answering to standard probabilistic evaluation on block-independent PDBs. The resulting complexity characterizations are immediately useful for assessing feasibility, and the explicit use of the missingness graph as a generative model offers a clear route to incorporating domain knowledge about why values are missing.

minor comments (2)
  1. [Abstract] Abstract: the phrase 'Our approach considerable departs' is grammatically incorrect and should read 'Our approach considerably departs'.
  2. [Section 3 (construction)] The factorization of the joint distribution over possible worlds (used to establish block independence) is central; a short illustrative example in the main text would help readers verify that the MG induces the claimed independence structure without additional assumptions.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the positive summary, significance assessment, and recommendation of minor revision. The report accurately captures the core contribution of modeling missingness mechanisms via a Bayesian network (Missingness Graph) to obtain a block-independent probabilistic database for query answering. No specific major comments or requested changes were provided in the report.

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper models the missingness mechanism explicitly as a Bayesian Network (Missingness Graph) over DB attributes, then defines the block-independent probabilistic DB via the factorization of the joint distribution over possible worlds consistent with the observed data and the MG. The two QA techniques are reductions to standard probabilistic query evaluation on this structure, and the PTIME/#P-hard complexity results follow directly from known results on block-independent PDBs once the construction is granted. No equation or claim reduces by construction to a fitted parameter, self-definition, or load-bearing self-citation; the central steps are formally defined and self-contained against external probabilistic database benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 2 invented entities

The central claim rests on the domain assumption that missingness mechanisms are accurately representable as Bayesian networks and that this representation plus observed data yields a usable probabilistic database. No free parameters are mentioned. Two new modeling entities are introduced without independent evidence outside the paper.

axioms (1)
  • domain assumption Missing values in the database are governed by a missingness mechanism that can be modeled as a Bayesian Network represented by a Missingness Graph involving the database attributes.
    This is the foundational modeling choice stated in the abstract as the basis for building the probabilistic DB.
invented entities (2)
  • Missingness Graph (MG) no independent evidence
    purpose: To represent the missingness mechanism as a Bayesian network for the database attributes.
    New construct introduced to model causes of missing values.
  • block-independent probabilistic DB no independent evidence
    purpose: To serve as the basis for query answering that captures uncertainty and plausibility of imputations.
    Constructed from the MG and observed database for the QA techniques.

pith-pipeline@v0.9.0 · 5414 in / 1527 out tokens · 53007 ms · 2026-05-10T17:48:45.807341+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

65 extracted references · 65 canonical work pages

  1. [1]

    K., Magnanti, T

    Ahuja, R. K., Magnanti, T. L. and Orlin, J.B. \ Network Flows: Theory, Algorithms, and Applications . Prentice Hall, 1993

  2. [2]

    Ali, S. M. and Silvey, S. D. A General Class of Coefficients of Divergence of one Distribution from Another. Journal of the Royal Statistical Society, Series B , 1996, 28(1):131--142

  3. [3]

    Allison, P. D. \ Maximum Likelihood Estimation . \ \ Quantitative Applications in the Social Sciences, 96. Sage Publications, 1993

  4. [4]

    Allison, P. D. \ Missing Data . \ Quantitative Applications in the Social Sciences, 136. Sage Publications, 2002

  5. [5]

    and Li, L

    Bertossi, L. and Li, L. \ Achieving Data Privacy through Secrecy Views and Null-Based Virtual Updates. \ IEEE Transactions on Knowledge and Data Engineering , 2013, 25(5):987-1000

  6. [6]

    and Bravo, L

    Bertossi, L. and Bravo, L. \ Consistency and Trust in Peer Data Exchange Systems. Theory and Practice of Logic Programming , 2017, 17(2):148-204

  7. [7]

    Bishop, C. M. \ Pattern Recognition and Machine Learning . \ Springer, 2006

  8. [8]

    \ A Fast Algorithm for Enumerating Bipartite Perfect Matchings., ISAAC 2001 (LNCS vol 2223), 2001

    Takeaki, U. \ A Fast Algorithm for Enumerating Bipartite Perfect Matchings., ISAAC 2001 (LNCS vol 2223), 2001

  9. [9]

    P., Peng, R., Gutenberg, M

    Chen, L., Kyng, R., Liu, Y. P., Peng, R., Gutenberg, M. P., & Sachdeva, S. (2022). Minimum cost flows, MDPs, and \( _1\)-regression in nearly linear time for separable convex instances. In Proceedings of the 54th Annual ACM SIGACT Symposium on Theory of Computing (STOC 2022)

  10. [10]

    and Peterfreund, L

    Console, M, Libkin, L. and Peterfreund, L. \ Querying Incomplete Numerical Data: Between Certain and Possible Answers. 2022, CoRR abs/2210.15395

  11. [11]

    and Suciu, D

    Dalvi, N. and Suciu, D. \ Efficient Query Evaluation on Probabilistic Databases. In Proc VLDB 2004

  12. [12]

    and Widom, J

    Das Sarma, A., Benjelloun, O., Halevy, A. and Widom, J. Working Models for Uncertain Data. Proc ICDE 2006

  13. [13]

    L., Berk, K

    Devore, J. L., Berk, K. N. and Carlton, M. A. \ Modern Mathematical Statistics with Applications . 3rd Ed., Springer, 2021

  14. [14]

    \ Bayesian Networks

    Darwiche, A. \ Bayesian Networks. Communications of the ACM , 2010, 53(12):80-90

  15. [15]

    \ Reasoning with Probabilistic and Deterministic Graphical Models

    Dechter, R. \ Reasoning with Probabilistic and Deterministic Graphical Models . \ 2nd Ed., Synthesis Lectures on Artificial Intelligence and Machine Learning, Morgan & Claypool Pubs., 2019

  16. [16]

    and Rekatsinas, T

    De Sa, C., Ilyas, I., Kimelfeld, B., Re, C. and Rekatsinas, T. \ A Formal Framework for Probabilistic Unclean Databases. Proc. ICDT 2019, pp. 6:1-6:18

  17. [17]

    Dinic, E. A. Algorithm for Solution of a Problem of Maximum Flow in a Network with Power Estimation. Soviet Mathematics Doklady, 1970, 11:1277--1280

  18. [18]

    aGrUM/pyAgrum: a toolbox to build models and algorithms for Probabilistic Graphical Models in Python

    Ducamp, G., Gonzales, C., Wuillemin, P.-H. aGrUM/pyAgrum: a toolbox to build models and algorithms for Probabilistic Graphical Models in Python. Proc. International Conference on Probabilistic Graphical Models, 2020, PMLR

  19. [19]

    \ Aggregation in Probabilistic Databases via Knowledge Compilation

    Fink , R., Han, L., Olteanu, D. \ Aggregation in Probabilistic Databases via Knowledge Compilation. Proc. VLDB Endow. 5(5): 490-501 (2012)

  20. [20]

    and Hill, J

    Gelman, A. and Hill, J. \ Data Analysis Using Regression and Multilevel/Hierarchical Models . Cambridge Univ. Press, 2007

  21. [21]

    and Sirangelo, C

    Gheerbrant, A., Libkin, L., Rogova, A. and Sirangelo, C. \ Querying Incomplete Data: Complexity and Tractability via Datalog and First-Order Rewritings. \ Theory and Practice of Logic Programming. \ Published online on 28 November 2023

  22. [22]

    and Kimelfeld, B

    Gilad, A., Imber, A. and Kimelfeld, B. \ The Consistency of Probabilistic Databases with Independent Cells. \ Proc. ICDT, 2023

  23. [23]

    and Spezzano, F

    Greco, S., Molinaro, C. and Spezzano, F. \ Incomplete Data and Data Dependencies in Relational Databases. Synthesis Lectures in Data Management, Morgan & Claypool Pubs., 2012

  24. [24]

    and Suciu, D

    Gribkoff, E., Van den Broeck, G. and Suciu, D. \ The Most Probable Database Problem. \ Proc. BUDA 2014, collocated with SIGMOD, 2014

  25. [25]

    , Matsui, T

    Fukuda, K. , Matsui, T. \ Finding all the perfect matchings in bipartite graphs. Applied Mathematics Letters . 1994

  26. [26]

    and Friedman, J

    Hastie, T., Tibshirani, R. and Friedman, J. \ The Elements of Statistical Learning . 2nd Ed., Springer, 2017

  27. [27]

    and Lipski Jr., W

    Imielinski, T. and Lipski Jr., W. \ Incomplete Information in Relational Databases. \ Journal of the ACM , 1984, 31(4):761-791

  28. [28]

    and Van den Broeck, G

    Khosravi, P., Liang, Y., Choi, Y. and Van den Broeck, G. \ What to Expect of Classifiers? Reasoning about Logistic Regression with Missing Features. \ Proc. IJCAI, 2019, pp. 2716-2724

  29. [29]

    Kuhn, H. W. \ The Hungarian Method for the Assignment Problem. \ Naval Research Logistics Quarterly , 1955, 2(1-2):83-97

  30. [30]

    and Dan Suciu, D

    Kumar Jha, A. and Dan Suciu, D. \ Probabilistic Databases with MarkoViews. Proc. VLDB 5(11):1160-1171, 2012

  31. [31]

    Little, R. J. and Rubin, D. B. \ Statistical Analysis with Missing Data . \ John Wiley & Sons, 3rd Ed., 2019

  32. [32]

    Littman, M. L. and Goldsmith, J. and Mundhenk, M. \ The Computational Complexity of Probabilistic Planning , JAIR, 1998

  33. [33]

    MacKay, D. J. \ Information Theory, Inference, and Learning Algorithms . Cambridge University Press, 2003

  34. [34]

    Solving Integer Minimum Cost Flows with Separable Convex Cost Objective Polynomially

    Minoux, M.. Solving Integer Minimum Cost Flows with Separable Convex Cost Objective Polynomially. In: Gallo, G., Sandi, C. (eds) `Netflow at Pisa'. Mathematical Programming Studies, Springer 1986, vol 26

  35. [35]

    and Tian, J

    Mohan, K., Pearl, J. and Tian, J. \ Graphical Models for Inference with Missing Data. \ Proc NIPS , 2013, pp. 1277-1285

  36. [36]

    On the Testability of Models with Missing Data

    Mohan, K and Pearl, J. On the Testability of Models with Missing Data. \ Proc. AISTAT, 2014

  37. [37]

    \ Graphical Models for Inference with Missing Data

    Mohan, K. \ Graphical Models for Inference with Missing Data. \ PhD Thesis, UCLA, 2017. https://escholarship.org/uc/item/6mk2b174

  38. [38]

    and Pearl, J

    Mohan, K. and Pearl, J. \ Graphical Models for Processing Missing Data. \ Journal of the American Statistical Association , 2021, 116(534):1023–1037

  39. [39]

    and Z\' a vodn\' y , J

    Olteanu, D. and Z\' a vodn\' y , J. \ Size Bounds for Factorised Representations of Query Results. \ ACM TODS , 2015, 40(1)

  40. [40]

    \ Computational Complexity , Addison-Wesley, Reading, MA, 1994

    Papadimitriou, C. \ Computational Complexity , Addison-Wesley, Reading, MA, 1994

  41. [41]

    \ Causality: Models, Reasoning and Inference

    Pearl, J. \ Causality: Models, Reasoning and Inference . \ Cambridge Univ. Press, 2nd edition, 2009

  42. [42]

    \ The Foundations of Causal Inference

    Pearl, J. \ The Foundations of Causal Inference. \ Sociological Methodology , 2010, 40:75–149. http://www.jstor.org/stable/41336883

  43. [43]

    and Suciu, D

    R\' e , C., Dalvi, N. and Suciu, D. \ Efficient Top-k Query Evaluation on Probabilistic Data. \ Proc. ICDE, 2007, pp. 886-895

  44. [44]

    and Naughton, J

    R\'e, C., Dalvi, N. and Naughton, J. F. \ Trio: A System for Integrated Management of Data, Accuracy, and Lineage. Proc. CIDR, 2005, pp. 262–276

  45. [45]

    Reiter, R.\ A Sound and Sometimes Complete Query Evaluation Algorithm for Relational Databases with Null Values. \ J. of the ACM , 1986, 33(2):349-370

  46. [46]

    \ Factorised representations of query results: size bounds and readability.\ Proc ICDT 2012

    Olteanu, Dan and Z\' a vodn\' y , Jakub. \ Factorised representations of query results: size bounds and readability.\ Proc ICDT 2012

  47. [47]

    \ The Trichotomy of HAVING Queries on a Probabilistic Database

    R\' e , C., Suciu, D. \ The Trichotomy of HAVING Queries on a Probabilistic Database. \ VLDB J. , 2009, 18(5):1091-1116

  48. [48]

    Royden, H. L. Real Analysis , 3rd ed. Macmillan, New York, 1988

  49. [49]

    and Suciu, D

    Salimi, B., Rodriguez, L., Howe, B. and Suciu, D. \ Interventional Fairness: Causal Database Repair for Algorithmic Fairness. \ Proc. SIGMOD, 2019, pp. 793–810

  50. [50]

    and Moslemi, M

    Salimi, B., Milani, M., Pirhadi, A., Cloninger, A. and Moslemi, M. H. \ OTClean: Data Cleaning for Conditional Independence Violations using Optimal Transport. \ Proc. SIGMOD, 2024

  51. [51]

    and Ramusat, Y

    Senellart, P., Jachiet, L., Maniu, S. and Ramusat, Y. \ ProvSQL: Provenance and Probability Management in PostgreSQL. \ Proc VLDB, 2018

  52. [52]

    \ Top-k Query Processing in Uncertain Databases

    Soliman, M., Ilyas, I and Chang, K. \ Top-k Query Processing in Uncertain Databases. In Proc. ICDE 2007

  53. [53]

    and Chang, K

    Soliman, M., Ilyas, I. and Chang, K. \ Probabilistic Top-k and Ranking-Aggregate Queries. ACM TODS , 2008, 33(3)

  54. [54]

    Enumeration complexity: Incremental time, delay and space (habilitation thesis).arXiv preprint arXiv:2309.17042,

    Strozecki, Y. \ Enumeration Complexity: Incremental Time, Delay and Space. arXiv, 2023., https://arxiv.org/abs/2309.17042

  55. [55]

    and Koch, C

    Suciu, D., Olteanu, D., R\'e, C. and Koch, C. \ Probabilistic Databases . \ Synthesis Lectures on Data Management, Morgan & Claypool Pubs., 2011

  56. [56]

    Complexity Classes Defined by Counting Quantifiers

    Tor\' a n, J. Complexity Classes Defined by Counting Quantifiers. J. ACM. , 1991

  57. [57]

    and Sequeda, J

    Toussaint, E., Guagliardo, P., Libkin, L. and Sequeda, J. \ Troubles with Nulls, Views from the Users. Proc. VLDB Endow., 2022, 15(11):2613-2625

  58. [58]

    and Pearl, J

    Van den Broeck, G., Mohan, K., and Choi, A., Darwiche, A. and Pearl, J. Efficient Algorithms for Bayesian Network Parameter Learning from Incomplete Data. Proc. UAI, 2013

  59. [59]

    and Suciu, D

    Van den Broeck, G. and Suciu, D. \ Query Processing on Probabilistic Data: A Survey. \ Foundations and Trends in Databases , 2015, 7(3-4):197–341. \ NOW Publishers

  60. [60]

    Valiant, L. G. \ The Complexity of Computing the Permanent. Theoretical Computer Science , 8(2):189--201, 1979

  61. [61]

    \ All of Statistics

    Wasserman, L. \ All of Statistics . \ Springer, 2010

  62. [62]

    Végh, L. A. (2016). A strongly polynomial algorithm for a class of minimum-cost flow problems with separable convex objectives. SIAM Journal on Computing, 45(5), 1729--1761

  63. [63]

    Wong, S. K. M. and Wang, Z. W. \ On Axiomatization of Probabilistic Conditional Independence. \ Proc. Conf. Uncertainty in Artificial Intelligence, 1994, pp. 591–597

  64. [64]

    Wong, S. K. M., Butz, C. J. and Wu, D. \ On the Implication Problem for Probabilistic Conditional Independency. \ IEEE Transactions on Systems, Man, and Cybernetics, Part A , 2000, 30(6):785-805

  65. [65]

    https://stats.oarc.ucla.edu/stata/seminars/mi_in_stata_pt1_new