pith. sign in

arxiv: 1907.06295 · v1 · pith:6SUUJKTFnew · submitted 2019-07-14 · 💻 cs.DB

An Approach Based on Bayesian Networks for Query Selectivity Estimation

Pith reviewed 2026-05-24 21:19 UTC · model grok-4.3

classification 💻 cs.DB
keywords query selectivity estimationChow-Liu treesBayesian networksattribute correlationsdatabase cost modelsTPC-DS benchmarkquery optimizationindependence assumption
0
0 comments X

The pith

Chow-Liu trees capture attribute correlations to produce selectivity estimates an order of magnitude more accurate than the independence assumption.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that building a Chow-Liu tree for each relation lets a cost model approximate the joint distribution of attribute values instead of treating them as independent. This change reduces the errors that arise when the optimizer is given badly wrong row-count predictions. On the TPC-DS benchmark the resulting estimates are roughly ten times more precise while the extra time and space cost stays modest. A reader would care because these estimates directly determine which join order, index, or algorithm the database chooses, so even modest gains in accuracy can improve end-to-end query speed. The work therefore relaxes one of the oldest simplifying assumptions in query optimization without destroying the practicality of the cost model.

Core claim

We propose a novel approach based on a particular type of Bayesian networks called Chow-Liu trees to approximate the distribution of attribute values inside each relation of a database. Our results on the TPC-DS benchmark show that our method is an order of magnitude more precise than other approaches whilst remaining reasonably efficient in terms of time and space.

What carries the argument

Chow-Liu trees, tree-structured Bayesian networks that encode the strongest pairwise dependencies between attributes and thereby approximate their joint distribution for selectivity calculation.

If this is right

  • Query optimizers receive row-count estimates whose errors are an order of magnitude smaller, so they more often select efficient join orders and access methods.
  • The space and time overhead of maintaining one Chow-Liu tree per relation stays low enough for practical deployment inside existing cost models.
  • The independence assumption can be dropped for any relation whose attributes exhibit tree-like pairwise correlations without redesigning the rest of the optimizer.
  • TPC-DS workloads, which contain realistic correlation patterns, serve as a concrete demonstration that the precision gain is achievable on standard decision-support data.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same trees could be refreshed incrementally when data changes, turning the method into a lightweight online statistics collector.
  • Because the trees are small, they could be shipped with data exports or used inside federated query engines that lack global statistics.
  • If higher-order dependencies prove important on some workloads, the Chow-Liu construction could be replaced by a slightly denser graphical model while preserving the same interface to the cost model.

Load-bearing premise

That Chow-Liu trees built from sampled data will capture the correlations that actually matter for selectivity estimation across the full range of queries without introducing new sources of error that offset the gains.

What would settle it

Measure selectivity error on a new benchmark containing known higher-order attribute dependencies; if the error is no longer reduced by roughly a factor of ten relative to the independence baseline, the central claim fails.

Figures

Figures reproduced from arXiv: 1907.06295 by Frank Morvan, Max Halford, Philippe Saint-Pierre.

Figure 1
Figure 1. Figure 1: Possible factorisations of P(hair, nationality, gender) The goal of structure learning is to find a BN that closely matches P(X) whilst preserving a low computational complexity. Indeed, for any given BN, the cost of storing it and of computing a marginal distribution P(Xi) depend on it’s structure. The classic approach to structure learning is to define a scoring function that determines how good a BN is … view at source ↗
Figure 2
Figure 2. Figure 2: Steiner tree in blue containing nodes G, N, and H nee [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Construction time 4.3 Cardinality estimates We then compared methods based on their average accuracy. In other words we ran each method against each query and measured the average error. Although our method improves the accuracy of cardinality estimation for queries on a single relation; our goal in this benchmark is to measure how much this will impact the overall accuracy for general queries over multipl… view at source ↗
Figure 4
Figure 4. Figure 4: Average errors Unsurprisingly, the textbook method produces estimates that are off by several orders of magnitude. What is more interesting is that Bayesian networks are significantly better than sampling. The reason this occurs is because the sampling method doesn’t place any uncertainty as to if a value is present in a relation or not. A value is either in a 8 [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Cardinality estimation time As can be seen the main pitfall of the sampling method is that it takes a considerable amount of time to estimate a car￾dinality. This is expected because for each set of predicates a full pass has to be made on the according sample. Whats more we didn’t even take into account the fact that the necessary samples have to be loaded in memory beforehand. As for the textbook method,… view at source ↗
read the original abstract

The efficiency of a query execution plan depends on the accuracy of the selectivity estimates given to the query optimiser by the cost model. The cost model makes simplifying assumptions in order to produce said estimates in a timely manner. These assumptions lead to selectivity estimation errors that have dramatic effects on the quality of the resulting query execution plans. A convenient assumption that is ubiquitous among current cost models is to assume that attributes are independent with each other. However, it ignores potential correlations which can have a huge negative impact on the accuracy of the cost model. In this paper we attempt to relax the attribute value independence assumption without unreasonably deteriorating the accuracy of the cost model. We propose a novel approach based on a particular type of Bayesian networks called Chow-Liu trees to approximate the distribution of attribute values inside each relation of a database. Our results on the TPC-DS benchmark show that our method is an order of magnitude more precise than other approaches whilst remaining reasonably efficient in terms of time and space.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript proposes using Chow-Liu trees (a restricted form of Bayesian network) to approximate the joint distribution of attribute values within each database relation. This relaxes the ubiquitous attribute-independence assumption in query cost models while aiming to keep estimation time and space costs modest. The central empirical claim is that the resulting selectivity estimates are an order of magnitude more precise than competing methods on the TPC-DS benchmark.

Significance. If the reported precision gains prove robust under controlled measurement and sensitivity checks, the work would supply a concrete, tree-structured mechanism for incorporating limited attribute correlations into selectivity estimation without incurring the cost of full joint distributions or high-dimensional histograms. Such a technique could be directly integrated into existing DBMS cost models.

major comments (2)
  1. [Abstract] Abstract: the statement that results show the method is 'an order of magnitude more precise than other approaches' supplies no information on the error metric employed (q-error, relative error, etc.), the concrete baseline implementations, the query workload size or mix, or any statistical measures such as standard deviation or confidence intervals. This directly undercuts the central empirical claim.
  2. [Evaluation] Evaluation (results section): no sensitivity analysis is presented on sample size used to build the Chow-Liu trees or on the fidelity of the learned tree structure to the predicate correlations actually appearing in TPC-DS queries. Because the approach relies on finite samples to recover both structure and parameters, any mismatch between the empirical distribution and the true distribution is load-bearing for the claimed accuracy improvement.
minor comments (1)
  1. [Abstract] Abstract: the phrase 'reasonably efficient in terms of time and space' is not accompanied by any concrete figures or comparison tables.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments. We address each major comment below and outline revisions that will strengthen the manuscript's clarity and empirical support.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the statement that results show the method is 'an order of magnitude more precise than other approaches' supplies no information on the error metric employed (q-error, relative error, etc.), the concrete baseline implementations, the query workload size or mix, or any statistical measures such as standard deviation or confidence intervals. This directly undercuts the central empirical claim.

    Authors: We agree that the abstract would benefit from additional context to substantiate the central claim. In the revised version we will expand the abstract to specify that precision is quantified via q-error, that the baselines comprise attribute independence as well as the other methods evaluated in the paper, that the workload consists of the full set of TPC-DS query templates, and that results include mean q-error together with standard deviation across repeated trials. These additions will be kept concise while supplying the requested information. revision: yes

  2. Referee: [Evaluation] Evaluation (results section): no sensitivity analysis is presented on sample size used to build the Chow-Liu trees or on the fidelity of the learned tree structure to the predicate correlations actually appearing in TPC-DS queries. Because the approach relies on finite samples to recover both structure and parameters, any mismatch between the empirical distribution and the true distribution is load-bearing for the claimed accuracy improvement.

    Authors: The observation is correct: the current evaluation fixes a single sample size and does not report sensitivity or structure validation against TPC-DS predicates. We will insert a new subsection that varies sample size (0.1 %, 1 %, 5 %, 10 % of each relation) and plots the resulting q-error, and we will add a brief analysis of the learned tree edges versus the most frequent selection and join predicates appearing in the benchmark queries. This directly addresses the robustness concern while remaining within reasonable space limits. revision: yes

Circularity Check

0 steps flagged

Empirical evaluation on external benchmark; no derivation reduces to its inputs by construction

full rationale

The paper presents a method that builds Chow-Liu trees from sampled data to approximate joint distributions and then measures selectivity estimation error on the TPC-DS benchmark. The central claim (order-of-magnitude improvement) is an empirical comparison against ground-truth selectivities computed from the benchmark data, not a quantity obtained by fitting parameters inside the same experiment and then renaming the fit as a prediction. No equations are shown that define the reported precision in terms of the model's own parameters, no self-citation is invoked as a uniqueness theorem, and the approach does not rename a known result under new coordinates. The derivation chain therefore remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review supplies no explicit free parameters, axioms, or invented entities; the method implicitly relies on standard assumptions of Bayesian network structure learning and the representativeness of TPC-DS data.

pith-pipeline@v0.9.0 · 5694 in / 1101 out tokens · 30352 ms · 2026-05-24T21:19:54.866637+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

38 extracted references · 38 canonical work pages · 2 internal anchors

  1. [1]

    Ac- cess path selection in a relational database management sys tem

    P Griffiths Selinger, Morton M Astrahan, Donald D Chamber lin, Raymond A Lorie, and Thomas G Price. Ac- cess path selection in a relational database management sys tem. In Proceedings of the 1979 ACM SIGMOD international conference on Management of data , pages 23–34. ACM, 1979

  2. [2]

    On the propagation of errors in the size of join results , vol- ume 20

    Y annis E Ioannidis and Stavros Christodoulakis. On the propagation of errors in the size of join results , vol- ume 20. ACM, 1991

  3. [3]

    Looking Back at Postgres

    Joseph M Hellerstein. Looking back at postgres. arXiv preprint arXiv:1901.01973, 2019

  4. [4]

    Presto: Interacting with petabytes of data at facebook

    Martin Traverso. Presto: Interacting with petabytes of data at facebook. Retrieved February, 4:2014, 2013

  5. [5]

    Spark sql: Rel ational data processing in spark

    Michael Armbrust, Reynold S Xin, Cheng Lian, Yin Huai, Da vies Liu, Joseph K Bradley, Xiangrui Meng, Tomer Kaftan, Michael J Franklin, Ali Ghodsi, et al. Spark sql: Rel ational data processing in spark. In Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data, pages 1383–1394. ACM, 2015

  6. [6]

    Improved histograms for selectivity estimation of range predicates

    Viswanath Poosala, Peter J Haas, Y annis E Ioannidis, and Eugene J Shekita. Improved histograms for selectivity estimation of range predicates. In ACM Sigmod Record, volume 25, pages 294–305. ACM, 1996

  7. [7]

    Self-tunin g, gpu-accelerated kernel density models for mul- tidimensional selectivity estimation

    Max Heimel, Martin Kiefer, and V olker Markl. Self-tunin g, gpu-accelerated kernel density models for mul- tidimensional selectivity estimation. In Proceedings of the 2015 ACM SIGMOD International Conferenc e on Management of Data, pages 1477–1492. ACM, 2015

  8. [8]

    Equi-depth multidim ensional histograms

    M Muralikrishna and David J DeWitt. Equi-depth multidim ensional histograms. In ACM SIGMOD Record , volume 17, pages 28–36. ACM, 1988

  9. [9]

    Accura te estimation of the number of tuples satisfying a condi- tion

    Gregory Piatetsky-Shapiro and Charles Connell. Accura te estimation of the number of tuples satisfying a condi- tion. ACM Sigmod Record, 14(2):256–276, 1984

  10. [10]

    Query size esti mation by adaptive sampling

    Richard J Lipton and Jeffrey F Naughton. Query size esti mation by adaptive sampling. In Proceedings of the ninth ACM SIGACT-SIGMOD-SIGART symposium on Principles of database systems, pages 40–46. ACM, 1990

  11. [11]

    Join size estimation subject to filter conditions

    David V engerov, Andre Cavalheiro Menck, Mohamed Zait, and Sunil P Chakkappen. Join size estimation subject to filter conditions. Proceedings of the VLDB Endowment , 8(12):1530–1541, 2015

  12. [12]

    Cardinality estimation done right: Index-based join sampling

    Viktor Leis, Bernharde Radke, Andrey Gubichev, Alfons Kemper, and Thomas Neumann. Cardinality estimation done right: Index-based join sampling. In CIDR, 2017

  13. [13]

    Two-level sampling for join size estim ation

    Y u Chen and Ke Yi. Two-level sampling for join size estim ation. In Proceedings of the 2017 ACM International Conference on Management of Data , pages 759–774. ACM, 2017

  14. [14]

    Learned Cardinalities: Estimating Correlated Joins with Deep Learning

    Andreas Kipf, Thomas Kipf, Bernhard Radke, Viktor Leis , Peter Boncz, and Alfons Kemper. Learned cardinali- ties: Estimating correlated joins with deep learning. arXiv preprint arXiv:1809.00677, 2018

  15. [15]

    How good are query optimizers, really? Proceedings of the VLDB Endowment , 9(3):204–215, 2015

    Viktor Leis, Andrey Gubichev, Atanas Mirchev, Peter Bo ncz, Alfons Kemper, and Thomas Neumann. How good are query optimizers, really? Proceedings of the VLDB Endowment , 9(3):204–215, 2015

  16. [16]

    Selec tivity estimation using probabilistic models

    Lise Getoor, Benjamin Taskar, and Daphne Koller. Selec tivity estimation using probabilistic models. In ACM SIGMOD Record, volume 30, pages 461–472. ACM, 2001

  17. [17]

    Lightweight graphical models for selectivity estima- tion without independence assumptions

    Kostas Tzoumas, Amol Deshpande, and Christian S Jensen . Lightweight graphical models for selectivity estima- tion without independence assumptions. Proceedings of the VLDB Endowment , 4(11):852–863, 2011. 10 A PREPRINT - J ULY 16, 2019

  18. [18]

    An introduction to Bayesian networks , volume 210

    Finn V Jensen. An introduction to Bayesian networks , volume 210. UCL press London, 1996

  19. [19]

    Neil Robertson and Paul D. Seymour. Graph minors. ii. al gorithmic aspects of tree-width. Journal of algorithms, 7(3):309–322, 1986

  20. [20]

    The optimization of queries in rela tional databases

    Robert Philip Kooi. The optimization of queries in rela tional databases. 1980

  21. [21]

    Optim al histograms for limiting worst-case error propagation in the size of join results

    Y annis E Ioannidis and Stavros Christodoulakis. Optim al histograms for limiting worst-case error propagation in the size of join results. ACM Transactions on Database Systems (TODS) , 18(4):709–748, 1993

  22. [22]

    St holes: a multidimensional workload-aware histogram

    Nicolas Bruno, Surajit Chaudhuri, and Luis Gravano. St holes: a multidimensional workload-aware histogram. In ACM SIGMOD Record, volume 30, pages 211–222. ACM, 2001

  23. [23]

    On rectangular partitionings in two dimensions: Algo- rithms, complexity and applications

    S Muthukrishnan, Viswanath Poosala, and Torsten Suel. On rectangular partitionings in two dimensions: Algo- rithms, complexity and applications. Database Theory—ICDT’99 , pages 236–256, 1999

  24. [24]

    The computational complexity of prob abilistic inference using bayesian belief networks

    Gregory F Cooper. The computational complexity of prob abilistic inference using bayesian belief networks. Artificial intelligence, 42(2-3):393–405, 1990

  25. [25]

    On random sampling over joins

    Surajit Chaudhuri, Rajeev Motwani, and Vivek Narasayy a. On random sampling over joins. In ACM SIGMOD Record, volume 28, pages 263–274. ACM, 1999

  26. [26]

    Random sampling from databases

    Frank Olken. Random sampling from databases . PhD thesis, University of California, Berkeley, 1993

  27. [27]

    Wander join: O nline aggregation via random walks

    Feifei Li, Bin Wu, Ke Yi, and Zhuoyue Zhao. Wander join: O nline aggregation via random walks. In Proceedings of the 2016 International Conference on Management of Data , pages 615–629. ACM, 2016

  28. [28]

    Join synopses for approxi- mate query answering

    Swarup Acharya, Phillip B Gibbons, Viswanath Poosala, and Sridhar Ramaswamy. Join synopses for approxi- mate query answering. In ACM SIGMOD Record, volume 28, pages 275–286. ACM, 1999

  29. [29]

    Leo-db2’s learning optimizer

    Michael Stillger, Guy M Lohman, V olker Markl, and Mokht ar Kandil. Leo-db2’s learning optimizer. In VLDB, volume 1, pages 19–28, 2001

  30. [30]

    Adaptive selectivity estimation using query feedback , vol- ume 23

    Chungmin Melvin Chen and Nick Roussopoulos. Adaptive selectivity estimation using query feedback , vol- ume 23. ACM, 1994

  31. [31]

    Le arning bayesian networks: The combination of knowledge and statistical data

    David Heckerman, Dan Geiger, and David M Chickering. Le arning bayesian networks: The combination of knowledge and statistical data. Machine learning, 20(3):197–243, 1995

  32. [32]

    Learning bayesian network structure using lp relaxations

    Tommi Jaakkola, David Sontag, Amir Globerson, and Mari na Meila. Learning bayesian network structure using lp relaxations. In Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, pages 358–365, 2010

  33. [33]

    Approximating discrete probabilit y distributions with dependence trees

    C Chow and Cong Liu. Approximating discrete probabilit y distributions with dependence trees. IEEE transac- tions on Information Theory , 14(3):462–467, 1968

  34. [34]

    Probabilistic networks and expert systems: Exact computational methods for Bayesian n etworks

    Robert G Cowell, Philip Dawid, Steffen L Lauritzen, and David J Spiegelhalter. Probabilistic networks and expert systems: Exact computational methods for Bayesian n etworks. Springer Science & Business Media, 2006

  35. [35]

    The Steiner tree problem, volume 53

    Frank K Hwang, Dana S Richards, and Pawel Winter. The Steiner tree problem, volume 53. Elsevier, 1992

  36. [36]

    Why you should run tpc-ds: a workload analysis

    Meikel Poess, Raghunath Othayoth Nambiar, and David Wa lrath. Why you should run tpc-ds: a workload analysis. In Proceedings of the 33rd international conference on V ery large data bases, pages 1138–1149. VLDB Endowment, 2007

  37. [37]

    Preventing bad plans by bounding the impact of cardinality estimation errors

    Guido Moerkotte, Thomas Neumann, and Gabriele Steidl. Preventing bad plans by bounding the impact of cardinality estimation errors. Proceedings of the VLDB Endowment , 2(1):982–993, 2009

  38. [38]

    Query optimization through the looking glass, and what we found running the join order benchmark

    Viktor Leis, Bernhard Radke, Andrey Gubichev, Atanas M irchev, Peter Boncz, Alfons Kemper, and Thomas Neumann. Query optimization through the looking glass, and what we found running the join order benchmark. The VLDB Journal , pages 1–26, 2018. 11