An Approach Based on Bayesian Networks for Query Selectivity Estimation
Pith reviewed 2026-05-24 21:19 UTC · model grok-4.3
The pith
Chow-Liu trees capture attribute correlations to produce selectivity estimates an order of magnitude more accurate than the independence assumption.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
We propose a novel approach based on a particular type of Bayesian networks called Chow-Liu trees to approximate the distribution of attribute values inside each relation of a database. Our results on the TPC-DS benchmark show that our method is an order of magnitude more precise than other approaches whilst remaining reasonably efficient in terms of time and space.
What carries the argument
Chow-Liu trees, tree-structured Bayesian networks that encode the strongest pairwise dependencies between attributes and thereby approximate their joint distribution for selectivity calculation.
If this is right
- Query optimizers receive row-count estimates whose errors are an order of magnitude smaller, so they more often select efficient join orders and access methods.
- The space and time overhead of maintaining one Chow-Liu tree per relation stays low enough for practical deployment inside existing cost models.
- The independence assumption can be dropped for any relation whose attributes exhibit tree-like pairwise correlations without redesigning the rest of the optimizer.
- TPC-DS workloads, which contain realistic correlation patterns, serve as a concrete demonstration that the precision gain is achievable on standard decision-support data.
Where Pith is reading between the lines
- The same trees could be refreshed incrementally when data changes, turning the method into a lightweight online statistics collector.
- Because the trees are small, they could be shipped with data exports or used inside federated query engines that lack global statistics.
- If higher-order dependencies prove important on some workloads, the Chow-Liu construction could be replaced by a slightly denser graphical model while preserving the same interface to the cost model.
Load-bearing premise
That Chow-Liu trees built from sampled data will capture the correlations that actually matter for selectivity estimation across the full range of queries without introducing new sources of error that offset the gains.
What would settle it
Measure selectivity error on a new benchmark containing known higher-order attribute dependencies; if the error is no longer reduced by roughly a factor of ten relative to the independence baseline, the central claim fails.
Figures
read the original abstract
The efficiency of a query execution plan depends on the accuracy of the selectivity estimates given to the query optimiser by the cost model. The cost model makes simplifying assumptions in order to produce said estimates in a timely manner. These assumptions lead to selectivity estimation errors that have dramatic effects on the quality of the resulting query execution plans. A convenient assumption that is ubiquitous among current cost models is to assume that attributes are independent with each other. However, it ignores potential correlations which can have a huge negative impact on the accuracy of the cost model. In this paper we attempt to relax the attribute value independence assumption without unreasonably deteriorating the accuracy of the cost model. We propose a novel approach based on a particular type of Bayesian networks called Chow-Liu trees to approximate the distribution of attribute values inside each relation of a database. Our results on the TPC-DS benchmark show that our method is an order of magnitude more precise than other approaches whilst remaining reasonably efficient in terms of time and space.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes using Chow-Liu trees (a restricted form of Bayesian network) to approximate the joint distribution of attribute values within each database relation. This relaxes the ubiquitous attribute-independence assumption in query cost models while aiming to keep estimation time and space costs modest. The central empirical claim is that the resulting selectivity estimates are an order of magnitude more precise than competing methods on the TPC-DS benchmark.
Significance. If the reported precision gains prove robust under controlled measurement and sensitivity checks, the work would supply a concrete, tree-structured mechanism for incorporating limited attribute correlations into selectivity estimation without incurring the cost of full joint distributions or high-dimensional histograms. Such a technique could be directly integrated into existing DBMS cost models.
major comments (2)
- [Abstract] Abstract: the statement that results show the method is 'an order of magnitude more precise than other approaches' supplies no information on the error metric employed (q-error, relative error, etc.), the concrete baseline implementations, the query workload size or mix, or any statistical measures such as standard deviation or confidence intervals. This directly undercuts the central empirical claim.
- [Evaluation] Evaluation (results section): no sensitivity analysis is presented on sample size used to build the Chow-Liu trees or on the fidelity of the learned tree structure to the predicate correlations actually appearing in TPC-DS queries. Because the approach relies on finite samples to recover both structure and parameters, any mismatch between the empirical distribution and the true distribution is load-bearing for the claimed accuracy improvement.
minor comments (1)
- [Abstract] Abstract: the phrase 'reasonably efficient in terms of time and space' is not accompanied by any concrete figures or comparison tables.
Simulated Author's Rebuttal
We thank the referee for the constructive comments. We address each major comment below and outline revisions that will strengthen the manuscript's clarity and empirical support.
read point-by-point responses
-
Referee: [Abstract] Abstract: the statement that results show the method is 'an order of magnitude more precise than other approaches' supplies no information on the error metric employed (q-error, relative error, etc.), the concrete baseline implementations, the query workload size or mix, or any statistical measures such as standard deviation or confidence intervals. This directly undercuts the central empirical claim.
Authors: We agree that the abstract would benefit from additional context to substantiate the central claim. In the revised version we will expand the abstract to specify that precision is quantified via q-error, that the baselines comprise attribute independence as well as the other methods evaluated in the paper, that the workload consists of the full set of TPC-DS query templates, and that results include mean q-error together with standard deviation across repeated trials. These additions will be kept concise while supplying the requested information. revision: yes
-
Referee: [Evaluation] Evaluation (results section): no sensitivity analysis is presented on sample size used to build the Chow-Liu trees or on the fidelity of the learned tree structure to the predicate correlations actually appearing in TPC-DS queries. Because the approach relies on finite samples to recover both structure and parameters, any mismatch between the empirical distribution and the true distribution is load-bearing for the claimed accuracy improvement.
Authors: The observation is correct: the current evaluation fixes a single sample size and does not report sensitivity or structure validation against TPC-DS predicates. We will insert a new subsection that varies sample size (0.1 %, 1 %, 5 %, 10 % of each relation) and plots the resulting q-error, and we will add a brief analysis of the learned tree edges versus the most frequent selection and join predicates appearing in the benchmark queries. This directly addresses the robustness concern while remaining within reasonable space limits. revision: yes
Circularity Check
Empirical evaluation on external benchmark; no derivation reduces to its inputs by construction
full rationale
The paper presents a method that builds Chow-Liu trees from sampled data to approximate joint distributions and then measures selectivity estimation error on the TPC-DS benchmark. The central claim (order-of-magnitude improvement) is an empirical comparison against ground-truth selectivities computed from the benchmark data, not a quantity obtained by fitting parameters inside the same experiment and then renaming the fit as a prediction. No equations are shown that define the reported precision in terms of the model's own parameters, no self-citation is invoked as a uniqueness theorem, and the approach does not rename a known result under new coordinates. The derivation chain therefore remains self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We propose a novel approach based on a particular type of Bayesian networks called Chow-Liu trees to approximate the distribution of attribute values inside each relation of a database.
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Chow-Liu trees minimise the KL divergence... retrieving a marginal distribution from a tree can be done in linear time
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Ac- cess path selection in a relational database management sys tem
P Griffiths Selinger, Morton M Astrahan, Donald D Chamber lin, Raymond A Lorie, and Thomas G Price. Ac- cess path selection in a relational database management sys tem. In Proceedings of the 1979 ACM SIGMOD international conference on Management of data , pages 23–34. ACM, 1979
work page 1979
-
[2]
On the propagation of errors in the size of join results , vol- ume 20
Y annis E Ioannidis and Stavros Christodoulakis. On the propagation of errors in the size of join results , vol- ume 20. ACM, 1991
work page 1991
-
[3]
Joseph M Hellerstein. Looking back at postgres. arXiv preprint arXiv:1901.01973, 2019
work page internal anchor Pith review Pith/arXiv arXiv 1901
-
[4]
Presto: Interacting with petabytes of data at facebook
Martin Traverso. Presto: Interacting with petabytes of data at facebook. Retrieved February, 4:2014, 2013
work page 2014
-
[5]
Spark sql: Rel ational data processing in spark
Michael Armbrust, Reynold S Xin, Cheng Lian, Yin Huai, Da vies Liu, Joseph K Bradley, Xiangrui Meng, Tomer Kaftan, Michael J Franklin, Ali Ghodsi, et al. Spark sql: Rel ational data processing in spark. In Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data, pages 1383–1394. ACM, 2015
work page 2015
-
[6]
Improved histograms for selectivity estimation of range predicates
Viswanath Poosala, Peter J Haas, Y annis E Ioannidis, and Eugene J Shekita. Improved histograms for selectivity estimation of range predicates. In ACM Sigmod Record, volume 25, pages 294–305. ACM, 1996
work page 1996
-
[7]
Self-tunin g, gpu-accelerated kernel density models for mul- tidimensional selectivity estimation
Max Heimel, Martin Kiefer, and V olker Markl. Self-tunin g, gpu-accelerated kernel density models for mul- tidimensional selectivity estimation. In Proceedings of the 2015 ACM SIGMOD International Conferenc e on Management of Data, pages 1477–1492. ACM, 2015
work page 2015
-
[8]
Equi-depth multidim ensional histograms
M Muralikrishna and David J DeWitt. Equi-depth multidim ensional histograms. In ACM SIGMOD Record , volume 17, pages 28–36. ACM, 1988
work page 1988
-
[9]
Accura te estimation of the number of tuples satisfying a condi- tion
Gregory Piatetsky-Shapiro and Charles Connell. Accura te estimation of the number of tuples satisfying a condi- tion. ACM Sigmod Record, 14(2):256–276, 1984
work page 1984
-
[10]
Query size esti mation by adaptive sampling
Richard J Lipton and Jeffrey F Naughton. Query size esti mation by adaptive sampling. In Proceedings of the ninth ACM SIGACT-SIGMOD-SIGART symposium on Principles of database systems, pages 40–46. ACM, 1990
work page 1990
-
[11]
Join size estimation subject to filter conditions
David V engerov, Andre Cavalheiro Menck, Mohamed Zait, and Sunil P Chakkappen. Join size estimation subject to filter conditions. Proceedings of the VLDB Endowment , 8(12):1530–1541, 2015
work page 2015
-
[12]
Cardinality estimation done right: Index-based join sampling
Viktor Leis, Bernharde Radke, Andrey Gubichev, Alfons Kemper, and Thomas Neumann. Cardinality estimation done right: Index-based join sampling. In CIDR, 2017
work page 2017
-
[13]
Two-level sampling for join size estim ation
Y u Chen and Ke Yi. Two-level sampling for join size estim ation. In Proceedings of the 2017 ACM International Conference on Management of Data , pages 759–774. ACM, 2017
work page 2017
-
[14]
Learned Cardinalities: Estimating Correlated Joins with Deep Learning
Andreas Kipf, Thomas Kipf, Bernhard Radke, Viktor Leis , Peter Boncz, and Alfons Kemper. Learned cardinali- ties: Estimating correlated joins with deep learning. arXiv preprint arXiv:1809.00677, 2018
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[15]
How good are query optimizers, really? Proceedings of the VLDB Endowment , 9(3):204–215, 2015
Viktor Leis, Andrey Gubichev, Atanas Mirchev, Peter Bo ncz, Alfons Kemper, and Thomas Neumann. How good are query optimizers, really? Proceedings of the VLDB Endowment , 9(3):204–215, 2015
work page 2015
-
[16]
Selec tivity estimation using probabilistic models
Lise Getoor, Benjamin Taskar, and Daphne Koller. Selec tivity estimation using probabilistic models. In ACM SIGMOD Record, volume 30, pages 461–472. ACM, 2001
work page 2001
-
[17]
Lightweight graphical models for selectivity estima- tion without independence assumptions
Kostas Tzoumas, Amol Deshpande, and Christian S Jensen . Lightweight graphical models for selectivity estima- tion without independence assumptions. Proceedings of the VLDB Endowment , 4(11):852–863, 2011. 10 A PREPRINT - J ULY 16, 2019
work page 2011
-
[18]
An introduction to Bayesian networks , volume 210
Finn V Jensen. An introduction to Bayesian networks , volume 210. UCL press London, 1996
work page 1996
-
[19]
Neil Robertson and Paul D. Seymour. Graph minors. ii. al gorithmic aspects of tree-width. Journal of algorithms, 7(3):309–322, 1986
work page 1986
-
[20]
The optimization of queries in rela tional databases
Robert Philip Kooi. The optimization of queries in rela tional databases. 1980
work page 1980
-
[21]
Optim al histograms for limiting worst-case error propagation in the size of join results
Y annis E Ioannidis and Stavros Christodoulakis. Optim al histograms for limiting worst-case error propagation in the size of join results. ACM Transactions on Database Systems (TODS) , 18(4):709–748, 1993
work page 1993
-
[22]
St holes: a multidimensional workload-aware histogram
Nicolas Bruno, Surajit Chaudhuri, and Luis Gravano. St holes: a multidimensional workload-aware histogram. In ACM SIGMOD Record, volume 30, pages 211–222. ACM, 2001
work page 2001
-
[23]
On rectangular partitionings in two dimensions: Algo- rithms, complexity and applications
S Muthukrishnan, Viswanath Poosala, and Torsten Suel. On rectangular partitionings in two dimensions: Algo- rithms, complexity and applications. Database Theory—ICDT’99 , pages 236–256, 1999
work page 1999
-
[24]
The computational complexity of prob abilistic inference using bayesian belief networks
Gregory F Cooper. The computational complexity of prob abilistic inference using bayesian belief networks. Artificial intelligence, 42(2-3):393–405, 1990
work page 1990
-
[25]
Surajit Chaudhuri, Rajeev Motwani, and Vivek Narasayy a. On random sampling over joins. In ACM SIGMOD Record, volume 28, pages 263–274. ACM, 1999
work page 1999
-
[26]
Random sampling from databases
Frank Olken. Random sampling from databases . PhD thesis, University of California, Berkeley, 1993
work page 1993
-
[27]
Wander join: O nline aggregation via random walks
Feifei Li, Bin Wu, Ke Yi, and Zhuoyue Zhao. Wander join: O nline aggregation via random walks. In Proceedings of the 2016 International Conference on Management of Data , pages 615–629. ACM, 2016
work page 2016
-
[28]
Join synopses for approxi- mate query answering
Swarup Acharya, Phillip B Gibbons, Viswanath Poosala, and Sridhar Ramaswamy. Join synopses for approxi- mate query answering. In ACM SIGMOD Record, volume 28, pages 275–286. ACM, 1999
work page 1999
-
[29]
Michael Stillger, Guy M Lohman, V olker Markl, and Mokht ar Kandil. Leo-db2’s learning optimizer. In VLDB, volume 1, pages 19–28, 2001
work page 2001
-
[30]
Adaptive selectivity estimation using query feedback , vol- ume 23
Chungmin Melvin Chen and Nick Roussopoulos. Adaptive selectivity estimation using query feedback , vol- ume 23. ACM, 1994
work page 1994
-
[31]
Le arning bayesian networks: The combination of knowledge and statistical data
David Heckerman, Dan Geiger, and David M Chickering. Le arning bayesian networks: The combination of knowledge and statistical data. Machine learning, 20(3):197–243, 1995
work page 1995
-
[32]
Learning bayesian network structure using lp relaxations
Tommi Jaakkola, David Sontag, Amir Globerson, and Mari na Meila. Learning bayesian network structure using lp relaxations. In Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, pages 358–365, 2010
work page 2010
-
[33]
Approximating discrete probabilit y distributions with dependence trees
C Chow and Cong Liu. Approximating discrete probabilit y distributions with dependence trees. IEEE transac- tions on Information Theory , 14(3):462–467, 1968
work page 1968
-
[34]
Probabilistic networks and expert systems: Exact computational methods for Bayesian n etworks
Robert G Cowell, Philip Dawid, Steffen L Lauritzen, and David J Spiegelhalter. Probabilistic networks and expert systems: Exact computational methods for Bayesian n etworks. Springer Science & Business Media, 2006
work page 2006
-
[35]
The Steiner tree problem, volume 53
Frank K Hwang, Dana S Richards, and Pawel Winter. The Steiner tree problem, volume 53. Elsevier, 1992
work page 1992
-
[36]
Why you should run tpc-ds: a workload analysis
Meikel Poess, Raghunath Othayoth Nambiar, and David Wa lrath. Why you should run tpc-ds: a workload analysis. In Proceedings of the 33rd international conference on V ery large data bases, pages 1138–1149. VLDB Endowment, 2007
work page 2007
-
[37]
Preventing bad plans by bounding the impact of cardinality estimation errors
Guido Moerkotte, Thomas Neumann, and Gabriele Steidl. Preventing bad plans by bounding the impact of cardinality estimation errors. Proceedings of the VLDB Endowment , 2(1):982–993, 2009
work page 2009
-
[38]
Query optimization through the looking glass, and what we found running the join order benchmark
Viktor Leis, Bernhard Radke, Andrey Gubichev, Atanas M irchev, Peter Boncz, Alfons Kemper, and Thomas Neumann. Query optimization through the looking glass, and what we found running the join order benchmark. The VLDB Journal , pages 1–26, 2018. 11
work page 2018
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.