pith. sign in

arxiv: 2409.07014 · v3 · pith:M6Y7KGVQnew · submitted 2024-09-11 · 📊 stat.ML · cs.DB· cs.LG

A Practical Theory of Generalization in Selectivity Learning

Pith reviewed 2026-05-23 21:22 UTC · model grok-4.3

classification 📊 stat.ML cs.DBcs.LG
keywords selectivity estimationquery-driven learningout-of-distribution generalizationsigned measuresPAC learningdatabase query optimizationmachine learning for databases
0
0 comments X

The pith

Selectivity predictors induced by signed measures are learnable and exhibit favorable out-of-distribution generalization bounds beyond the PAC framework under mild assumptions.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that query-driven selectivity models based on signed measures can be learned without requiring probability measures, relaxing a key restriction in existing theory. It then derives out-of-distribution generalization error bounds that go beyond what the standard PAC learning framework provides, again under mild assumptions. These results directly motivate two practical strategies for improving how existing selectivity models handle unseen query workloads. The work verifies the approach both theoretically and through experiments measuring prediction accuracy and end-to-end query latency.

Core claim

Selectivity predictors induced by signed measures are learnable and, under mild assumptions, exhibit favorable out-of-distribution generalization error bounds beyond the standard PAC framework; these advances enable two general strategies that improve OOD generalization for query-driven selectivity models while preserving strong in-distribution performance.

What carries the argument

Selectivity predictors induced by signed measures, which relax the probability-measure restriction of prior theory and support OOD error bounds.

If this is right

  • Query-driven selectivity models can be designed with explicit OOD guarantees rather than relying solely on in-distribution PAC bounds.
  • Two general improvement strategies become available that raise accuracy on unseen queries without harming performance on training distributions.
  • Database systems can expect better query planning latency when selectivity estimates must transfer to new workloads.
  • The gap between practical selectivity estimators and theoretical learning guarantees narrows for the signed-measure class.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Similar signed-measure constructions could be tested for cardinality estimation tasks that also require cross-distribution robustness.
  • The OOD bounds may suggest regularization choices during training that explicitly target distribution shift in other database ML settings.
  • If the mild assumptions prove easy to satisfy in practice, the approach could extend to learned cost models or join-order predictors facing workload drift.

Load-bearing premise

The mild assumptions needed for the out-of-distribution generalization error bounds to hold.

What would settle it

An empirical test in which the proposed OOD strategies produce no measurable improvement in prediction error or latency on workloads drawn from a different distribution than the training data, despite the stated assumptions being satisfied.

Figures

Figures reproduced from arXiv: 2409.07014 by Haoshu Xu, Peizhi Wu, Ryan Marcus, Zachary G. Ives.

Figure 1
Figure 1. Figure 1: Left: three data points (𝐴, 𝐵,𝐶) and four ranges. Right Top: predictions from selectivity functions (𝑆1 ∼ 𝑆4) for the four ranges. Right Bottom: corresponding measures (𝜇1 ∼ 𝜇4) that induce 𝑆1 ∼ 𝑆4, including their measure values on 𝐴, 𝐵,𝐶 (technically, the three sets {𝐴}, {𝐵}, {𝐶}). subsets of X closed under complements and countable unions and intersections. For all practical applications, it holds that … view at source ↗
Figure 2
Figure 2. Figure 2: Left: relationship between a rectangle query and [PITH_FULL_IMAGE:figures/full_fig_p008_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Training a selectivity model M with SeConCDF. modeling to achieve both strong in-distribution generalization with arbitrary loss functions and improved OOD generalization? SeConCDF addresses this limitation by adopting direct query selectivity modeling (which avoids the negative estimate issue) and introducing a soft constraint on the signed measure, unlike the hard constraint used in NeuroCDF. Specificall… view at source ↗
Figure 4
Figure 4. Figure 4: Accuracy on Census with granularity shifts. [PITH_FULL_IMAGE:figures/full_fig_p012_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Per-query latency performance on CEB-1a-varied under granularity shift. Left: In-Dist queries; Right: OOD queries. 2479 3 1345 668 1665 1180 1104 2069 316 951 2713 478 1215 2645 787 2840 2629 752 2695 1633 Query ID (sorted by Postgres running time) 0 20 40 60 80 100 Running Time (s) PostgreSQL (workload time: 606s) MSCN (workload time: 242s) Robust-MSCN* (workload time: 235s) MSCN+CDF (workload time: 242s)… view at source ↗
Figure 6
Figure 6. Figure 6: Accuracy of OOD point queries on IMDb-small. (a) center move (b) granularity shift (c) center move (d) granularity shift [PITH_FULL_IMAGE:figures/full_fig_p013_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: OOD query latency performance on IMDb-small [PITH_FULL_IMAGE:figures/full_fig_p013_7.png] view at source ↗
read the original abstract

Query-driven machine learning models have emerged as a promising estimation technique for query selectivities. Yet, surprisingly little is known about the efficacy of these techniques from a theoretical perspective, as there exist substantial gaps between practical solutions and state-of-the-art (SOTA) theory based on the Probably Approximately Correct (PAC) learning framework. In this paper, we aim to bridge the gaps between theory and practice. First, we demonstrate that selectivity predictors induced by signed measures are learnable, which relaxes the reliance on probability measures in SOTA theory. More importantly, beyond the PAC learning framework (which only allows us to characterize how the model behaves when both training and test workloads are drawn from the same distribution), we establish, under mild assumptions, that selectivity predictors from this class exhibit favorable out-of-distribution (OOD) generalization error bounds. These theoretical advances provide us with a better understanding of both the in-distribution and OOD generalization capabilities of query-driven selectivity learning, and facilitate the design of two general strategies to improve OOD generalization for existing query-driven selectivity models. We empirically verify that our techniques help query-driven selectivity models generalize significantly better to OOD queries both in terms of prediction accuracy and query latency performance, while maintaining their superior in-distribution generalization performance.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper claims that selectivity predictors induced by signed measures are learnable (relaxing the probability-measure restriction in prior PAC-based theory), and that under mild assumptions these predictors admit favorable OOD generalization error bounds that go beyond the standard PAC i.i.d. setting. The theory is used to derive two general strategies for improving OOD performance of existing query-driven selectivity models; these strategies are then shown empirically to improve both prediction accuracy and end-to-end query latency on OOD workloads while preserving in-distribution performance.

Significance. If the OOD bounds are valid under assumptions that are both mild and satisfied by realistic query shifts, the work supplies the first explicit theoretical account of out-of-distribution behavior for this class of estimators and directly informs practical model design. The combination of a signed-measure relaxation, explicit (if still unspecified) OOD bounds, and reproducible empirical validation on latency metrics constitutes a concrete advance over purely PAC-style analyses.

major comments (2)
  1. [Abstract / OOD generalization section] Abstract (OOD paragraph) and the section deriving the OOD bounds: the claim that the predictors 'exhibit favorable out-of-distribution (OOD) generalization error bounds' under 'mild assumptions' is load-bearing for the central contribution, yet the assumptions themselves (e.g., conditions on the Radon-Nikodym derivative between training and test measures, Lipschitz continuity of the shift, or support overlap) are never stated explicitly. Without an enumerated list it is impossible to verify whether the assumptions are weaker than PAC i.i.d., whether they hold for the workload shifts used in the experiments, or whether the derived bounds are non-vacuous.
  2. [Empirical evaluation] The empirical section reporting OOD results: the paper asserts that the proposed strategies 'help query-driven selectivity models generalize significantly better to OOD queries both in terms of prediction accuracy and query latency performance,' but no quantitative comparison is given against the theoretical OOD bound (e.g., whether observed error lies inside the derived bound or how the bound scales with the observed distribution shift). This leaves the link between the theoretical guarantee and the reported numbers unverified.
minor comments (2)
  1. [Theoretical development] Notation for signed measures and the induced predictors should be introduced once with a clear definition before being used in the learnability statement.
  2. [Practical strategies] The two general strategies for improving OOD generalization are described at a high level; a short algorithmic pseudocode or explicit optimization objective for each would improve reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. The comments correctly identify areas where the presentation of the OOD theory can be strengthened for clarity and verifiability. We address each point below and will revise the manuscript accordingly.

read point-by-point responses
  1. Referee: [Abstract / OOD generalization section] Abstract (OOD paragraph) and the section deriving the OOD bounds: the claim that the predictors 'exhibit favorable out-of-distribution (OOD) generalization error bounds' under 'mild assumptions' is load-bearing for the central contribution, yet the assumptions themselves (e.g., conditions on the Radon-Nikodym derivative between training and test measures, Lipschitz continuity of the shift, or support overlap) are never stated explicitly. Without an enumerated list it is impossible to verify whether the assumptions are weaker than PAC i.i.d., whether they hold for the workload shifts used in the experiments, or whether the derived bounds are non-vacuous.

    Authors: We agree that an explicit, enumerated list of assumptions is needed for verifiability. The OOD bounds rely on three main conditions: (1) the Radon-Nikodym derivative dP_test/dP_train is bounded by a finite constant M (allowing controlled shift), (2) the selectivity function is Lipschitz continuous w.r.t. a metric on the query feature space, and (3) the training and test supports overlap on a set of positive measure. These are strictly weaker than the i.i.d. assumption of standard PAC learning. In the revision we will insert a dedicated paragraph (or subsection) that enumerates these assumptions, proves they are milder than PAC-i.i.d., verifies they hold for the query shifts used in the experiments (via explicit computation of the Radon-Nikodym norm on the workloads), and shows that the resulting bounds remain non-vacuous for moderate shifts (M ≤ 5). revision: yes

  2. Referee: [Empirical evaluation] The empirical section reporting OOD results: the paper asserts that the proposed strategies 'help query-driven selectivity models generalize significantly better to OOD queries both in terms of prediction accuracy and query latency performance,' but no quantitative comparison is given against the theoretical OOD bound (e.g., whether observed error lies inside the derived bound or how the bound scales with the observed distribution shift). This leaves the link between the theoretical guarantee and the reported numbers unverified.

    Authors: We accept that a direct numerical link between the derived OOD bound and the reported errors is currently missing. The manuscript demonstrates that the two practical strategies (derived from the theory) improve OOD accuracy and latency, but does not tabulate observed error versus the bound value. In the revised version we will add a short subsection (or appendix table) that, for each OOD workload, reports (a) the measured distribution shift (e.g., estimated Radon-Nikodym norm or total variation), (b) the empirical OOD error, and (c) the numerical value of the theoretical bound under the observed shift. We will also note any looseness in the bound. This will make the connection between theory and experiment explicit. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation presented as independent theoretical advance

full rationale

The abstract states that selectivity predictors induced by signed measures are learnable and yield OOD generalization bounds under mild assumptions beyond PAC, but provides no equations, fitted parameters, or self-citations that reduce these claims to their inputs by construction. The derivation chain is therefore self-contained against external benchmarks, with the unspecified assumptions representing a presentation gap rather than a definitional or fitted-input circularity.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Only the abstract is available. The central claims rest on unspecified 'mild assumptions' for the OOD bounds; no free parameters, invented entities, or additional axioms are mentioned.

axioms (1)
  • domain assumption Mild assumptions under which OOD generalization error bounds hold
    Abstract states the bounds hold 'under mild assumptions' but does not enumerate them.

pith-pipeline@v0.9.0 · 5758 in / 1186 out tokens · 23562 ms · 2026-05-23T21:22:14.530887+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

63 extracted references · 63 canonical work pages

  1. [1]

    https://github.com/learnedsystems/CEB/tree/main

    Ceb. https://github.com/learnedsystems/CEB/tree/main

  2. [2]

    https://github.com/waltercai/pqo-opensource

    Modified postgresql. https://github.com/waltercai/pqo-opensource

  3. [3]

    https://github.com/andreaskipf/learnedcardinalities

    MSCN code. https://github.com/andreaskipf/learnedcardinalities. Last Accessed Date: 2025-04-05

  4. [4]

    https://github.com/huxiao2010/Selectivity/

    PtsHist and QuadHist code. https://github.com/huxiao2010/Selectivity/. Last Accessed Date: 2025-04-05

  5. [5]

    Foundations of databases, vol

    Abiteboul, S., Hull, R., and Vianu, V. Foundations of databases, vol. 8. Addison- Wesley Reading, 1995

  6. [6]

    In Proceedings of the 1999 ACM SIGMOD international conference on Management of data (1999), ACM, pp

    Aboulnaga, A., and Chaudhuri, S.Self-tuning histograms: building histograms without looking at data. In Proceedings of the 1999 ACM SIGMOD international conference on Management of data (1999), ACM, pp. 181–192

  7. [7]

    Scale-sensitive dimensions, uniform convergence, and learnability

    Alon, N., Ben-David, S., Cesa-Bianchi, N., and Haussler, D. Scale-sensitive dimensions, uniform convergence, and learnability. Journal of the ACM 44 , 4 (1997), 615–631

  8. [8]

    Learning Set Cardinality in Distance Nearest Neighbours

    Anagnostopoulos, C., and Triantafillou, P. Learning Set Cardinality in Distance Nearest Neighbours. In 2015 IEEE International Conference on Data Mining (2015), IEEE, pp. 691–696

  9. [9]

    Learning to accurately COUNT with query-driven predictive analytics

    Anagnostopoulos, C., and Triantafillou, P. Learning to accurately COUNT with query-driven predictive analytics. In 2015 IEEE International Conference on Big Data (Big Data) (2015), IEEE, pp. 14–23

  10. [10]

    Query-Driven Learning for Pre- dictive Analytics of Data Subspace Cardinality

    Anagnostopoulos, C., and Triantafillou, P. Query-Driven Learning for Pre- dictive Analytics of Data Subspace Cardinality. ACM Transactions on Knowledge Discovery from Data 11 , 4 (2017), 1–46

  11. [11]

    L., and Long, P

    Bartlett, P. L., and Long, P. M.More theorems about scale-sensitive dimensions and learning. In Proceedings of the eighth annual conference on Computational learning theory (1995), ACM Press, pp. 392–401

  12. [12]

    Bishop, C. M. Pattern recognition and machine learning . Information science and statistics. Springer, 2006

  13. [13]

    Exploiting statistics on query expressions for optimization

    Bruno, N., and Chaudhuri, S. Exploiting statistics on query expressions for optimization. In Proceedings of the 2002 ACM SIGMOD international conference on Management of data (2002), ACM, pp. 263–274

  14. [14]

    STHoles: a multidimensional workload-aware histogram

    Bruno, N., Chaudhuri, S., and Gravano, L. STHoles: a multidimensional workload-aware histogram. In Proceedings of the 2001 ACM SIGMOD international conference on Management of data (2001), ACM, pp. 211–222

  15. [15]

    M., and Roussopoulos, N

    Chen, C. M., and Roussopoulos, N. Adaptive selectivity estimation using query feedback. In Proceedings of the 1994 ACM SIGMOD international conference on Management of data (1994), ACM, pp. 161–172

  16. [16]

    XGBoost: A Scalable Tree Boosting System

    Chen, T., and Guestrin, C. XGBoost: A Scalable Tree Boosting System. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (2016), ACM, pp. 785–794

  17. [17]

    B., Suciu, D., and Balazinska, M

    Deeds, K. B., Suciu, D., and Balazinska, M. SafeBound: A Practical System for Generating Cardinality Bounds. Proceedings of the ACM on Management of Data 1, 1 (2023), 1–26

  18. [18]

    Proceedings of the VLDB Endowment 14 , 13 (2021), 3376–3388

    Ding, B., Chaudhuri, S., Gehrke, J., and Narasayya, V.DSB: a decision support benchmark for workload-driven and traditional database systems. Proceedings of the VLDB Endowment 14 , 13 (2021), 3376–3388

  19. [19]

    Probability: theory and examples, fifth edition ed

    Durrett, R. Probability: theory and examples, fifth edition ed. No. 49 in Cambridge series in statistical and probabilistic mathematics. Cambridge University Press, Cambridge ; New York, NY, 2019

  20. [20]

    Proceedings of the VLDB Endowment 13 , 12 (2020), 2215–2228

    Dutt, A., W ang, C., Narasayya, V., and Chaudhuri, S.Efficiently approximat- ing selectivity functions using low overhead regression models. Proceedings of the VLDB Endowment 13 , 12 (2020), 2215–2228

  21. [21]

    Selectivity estimation for range predicates using lightweight models

    Dutt, A., W ang, C., Nazi, A., Kandula, S., Narasayya, V., and Chaudhuri, S. Selectivity estimation for range predicates using lightweight models. Proceedings of the VLDB Endowment 12 , 9 (2019), 1044–1057

  22. [22]

    Deep Residual Learning for Image Recognition

    He, K., Zhang, X., Ren, S., and Sun, J. Deep Residual Learning for Image Recognition. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2016), IEEE, pp. 770–778

  23. [23]

    Convolution and Cross- Correlation of Count Sketches Enables Fast Cardinality Estimation of Multi-Join Queries

    Heddes, M., Nunes, I., Givargis, T., and Nicolau, A. Convolution and Cross- Correlation of Count Sketches Enables Fast Cardinality Estimation of Multi-Join Queries. Proceedings of the ACM on Management of Data 2 , 3 (2024), 1–26

  24. [24]

    Probability of error, equivocation, and the Chernoff bound

    Hellman, M., and Raviv, J. Probability of error, equivocation, and the Chernoff bound. IEEE Transactions on Information Theory 16 , 4 (1970), 368–372

  25. [25]

    DeepDB: learn from data, not from queries!Proceedings of the VLDB Endowment 13, 7 (2020), 992–1005

    Hilprecht, B., Schmidt, A., Kulessa, M., Molina, A., Kersting, K., and Binnig, C. DeepDB: learn from data, not from queries!Proceedings of the VLDB Endowment 13, 7 (2020), 992–1005

  26. [26]

    K., Panigrahi, D., Roy, S., and Yang, J

    Hu, X., Liu, Y., Xiu, H., Agarwal, P. K., Panigrahi, D., Roy, S., and Yang, J. Selectivity Functions of Range Queries are Learnable. In Proceedings of the 2022 ACM SIGMOD International Conference on Management of Data (2022), ACM, pp. 959–972

  27. [27]

    E., and Christodoulakis, S

    Ioannidis, Y. E., and Christodoulakis, S. On the propagation of errors in the size of join results. In Proceedings of the 1991 ACM SIGMOD international conference on Management of data (1991), ACM Press, pp. 268–277

  28. [28]

    J., and Schapire, R

    Kearns, M. J., and Schapire, R. E. Efficient distribution-free learning of proba- bilistic concepts. Journal of Computer and System Sciences 48 , 3 (1994), 464–497

  29. [29]

    J., V azirani, U

    Kearns, M. J., V azirani, U. V., and V azirani, U. V.An introduction to computa- tional learning theory, 2. print ed. MIT Press, Cambridge, Mass., 1997

  30. [30]

    Learned Cardi- nality Estimation: An In-depth Study

    Kim, K., Jung, J., Seo, I., Han, W.-S., Choi, K., and Chong, J. Learned Cardi- nality Estimation: An In-depth Study. In Proceedings of the 2022 ACM SIGMOD International Conference on Management of Data (2022), ACM, pp. 1214–1227

  31. [31]

    Proceedings of the ACM on Management of Data 2 , 1 (2024), 1–27

    Kim, K., Lee, S., Kim, I., and Han, W.-S.ASM: Harmonizing Autoregressive Model, Sampling, and Multi-dimensional Statistics Merging for Cardinality Estimation. Proceedings of the ACM on Management of Data 2 , 1 (2024), 1–27

  32. [32]

    A., and Kemper, A

    Kipf, A., Kipf, T., Radke, B., Leis, V., Boncz, P. A., and Kemper, A. Learned cardinalities: Estimating correlated joins with deep learning. In 9th Biennial Conference on Innovative Data Systems Research, CIDR 2019 (2019)

  33. [33]

    How good are query optimizers, really? Proceedings of the VLDB Endowment 9 , 3 (2015), 204–215

    Leis, V., Gubichev, A., Mirchev, A., Boncz, P., Kemper, A., and Neumann, T. How good are query optimizers, really? Proceedings of the VLDB Endowment 9 , 3 (2015), 204–215

  34. [34]

    ALECE: An Attention- based Learned Cardinality Estimator for SPJ Queries on Dynamic Workloads

    Li, P., Wei, W., Zhu, R., Ding, B., Zhou, J., and Lu, H. ALECE: An Attention- based Learned Cardinality Estimator for SPJ Queries on Dynamic Workloads. Proceedings of the VLDB Endowment 17 , 2 (2023), 197–210

  35. [35]

    Lim, L., Wang, M., and Vitter, J. S. SASH: A self-adaptive histogram set for dynamically changing workloads. In Proceedings of 29th International Conference on Very Large Data Bases, VLDB 2003, Berlin, Germany, September 9-12, 2003 (2003), pp. 369–380

  36. [36]

    Certified monotonic neural networks

    Liu, X., Han, X., Zhang, N., and Liu, Q. Certified monotonic neural networks. In Advances in Neural Information Processing Systems 33 (2020)

  37. [37]

    Lynch, C. A. Selectivity estimation and query optimization in large databases with highly skewed distribution of column values. In Proceedings of the VLDB Endowment (1988), pp. 240–251

  38. [38]

    J., Kutsch, M., Megiddo, N., Srivastava, U., and Tran, T

    Markl, V., Haas, P. J., Kutsch, M., Megiddo, N., Srivastava, U., and Tran, T. M. Consistent selectivity estimation via maximum entropy. The VLDB Journal 16, 1 (2006), 55–76

  39. [39]

    M., and Raman, V

    Markl, V., Lohman, G. M., and Raman, V. LEO: An autonomic query optimizer for DB2. IBM Systems Journal 42 , 1 (2003), 98–106

  40. [40]

    Selectivity Estimation for Queries Containing Predicates over Set-Valued Attributes

    Meng, Z., Cao, X., and Cong, G. Selectivity Estimation for Queries Containing Predicates over Set-Valued Attributes. Proceedings of the ACM on Management of Data 1, 4 (2023), 1–26

  41. [41]

    Proceedings of the VLDB Endowment 2, 1 (2009), 982–993

    Moerkotte, G., Neumann, T., and Steidl, G.Preventing bad plans by bounding the impact of cardinality estimation errors. Proceedings of the VLDB Endowment 2, 1 (2009), 982–993

  42. [42]

    Flow-loss: learning cardinality estimates that matter

    Negi, P., Marcus, R., Kipf, A., Mao, H., Tatbul, N., Kraska, T., and Alizadeh, M. Flow-loss: learning cardinality estimates that matter. Proceedings of the VLDB Endowment 14, 11 (2021), 2019–2032

  43. [43]

    Robust Query Driven Cardinality Estimation under Changing Workloads

    Negi, P., Wu, Z., Kipf, A., Tatbul, N., Marcus, R., Madden, S., Kraska, T., and Alizadeh, M. Robust Query Driven Cardinality Estimation under Changing Workloads. Proceedings of the VLDB Endowment 16 , 6 (2023), 1520–1533

  44. [44]

    QuickSel: Quick Selectivity Learning with Mixture Models

    Park, Y., Zhong, S., and Mozafari, B. QuickSel: Quick Selectivity Learning with Mixture Models. In Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data (2020), ACM, pp. 1017–1033

  45. [45]

    TPC-DS, taking decision support benchmarking to the next level

    Poess, M., Smith, B., Kollar, L., and Larson, P. TPC-DS, taking decision support benchmarking to the next level. In Proceedings of the 2002 ACM SIGMOD international conference on Management of data (2002), ACM, pp. 582–587

  46. [46]

    Sample-Efficient Cardinality Estimation Using Geometric Deep Learning

    Reiner, S., and Grossniklaus, M. Sample-Efficient Cardinality Estimation Using Geometric Deep Learning. Proceedings of the VLDB Endowment 17, 4 (2023), 740–752

  47. [47]

    S., and Tesman, B

    Roberts, F. S., and Tesman, B. Applied combinatorics, second edition ed. CRC Press, 2009

  48. [48]

    Learning from label proportions: A mutual contami- nation framework

    Scott, C., and Zhang, J. Learning from label proportions: A mutual contami- nation framework. Advances in neural information processing systems 33 (2020), 22256–22267

  49. [49]

    G., Astrahan, M

    Selinger, P. G., Astrahan, M. M., Chamberlin, D. D., Lorie, R. A., and Price, T. G. Access path selection in a relational database management system. In Proceedings of the 1979 ACM SIGMOD international conference on Management of A Practical Theory of Generalization in Selectivity Learning Conference’17, July 2017, Washington, DC, USA data (1979), ACM Pre...

  50. [50]

    Stein, E. M. Real Analysis: Measure Theory, Integration, and Hilbert Spaces , 1st ed ed. Princeton University Press, 2009

  51. [51]

    M., Markl, V., and Kandil, M

    Stillger, M., Lohman, G. M., Markl, V., and Kandil, M. LEO - db2’s learning optimizer. In Proceedings of the VLDB Endowment (2001), pp. 19–28

  52. [52]

    An end-to-end learning-based cost estimator

    Sun, J., and Li, G. An end-to-end learning-based cost estimator. Proceedings of the VLDB Endowment 13 , 3 (2019), 307–319

  53. [53]

    L., Li, S., Mao, Z., and Tang, B

    Wang, F., Yan, X., Yiu, M. L., Li, S., Mao, Z., and Tang, B. Speeding Up End- to-end Query Execution via Learning-based Progressive Cardinality Estimation. Proceedings of the ACM on Management of Data 1 , 1 (2023), 1–25

  54. [54]

    Decomposition—a strategy for query processing

    Wong, E., and Youssefi, K. Decomposition—a strategy for query processing. ACM Transactions on Database Systems 1 , 3 (1976), 223–241

  55. [55]

    Proceedings of the VLDB Endowment 12 , 3 (2018), 210–222

    Wu, C., Jindal, A., Amizadeh, S., Patel, H., Le, W., Qiao, S., and Rao, S.Towards a learning optimizer for shared clouds. Proceedings of the VLDB Endowment 12 , 3 (2018), 210–222

  56. [56]

    A Unified Deep Model of Learning from both Data and Queries for Cardinality Estimation

    Wu, P., and Cong, G. A Unified Deep Model of Learning from both Data and Queries for Cardinality Estimation. In Proceedings of the 2021 International Conference on Management of Data (2021), ACM, pp. 2009–2022

  57. [57]

    Wu, P., and Ives, Z. G. Modeling Shifting Workloads for Learned Database Systems. Proceedings of the ACM on Management of Data 2 , 1 (2024), 1–27

  58. [58]

    FactorJoin: A New Cardinality Estimation Framework for Join Queries

    Wu, Z., Negi, P., Alizadeh, M., Kraska, T., and Madden, S. FactorJoin: A New Cardinality Estimation Framework for Join Queries. Proceedings of the ACM on Management of Data 1 , 1 (2023), 1–27

  59. [59]

    K., and Yang, J

    Xiu, H., Agarwal, P. K., and Yang, J. PARQO: Penalty-Aware Robust Plan Selection in Query Optimization. Proceedings of the VLDB Endowment 17 , 13 (2024), 4627–4640

  60. [60]

    NeuroCard: one cardinality estimator for all tables

    Yang, Z., Kamsetty, A., Luan, S., Liang, E., Duan, Y., Chen, X., and Stoica, I. NeuroCard: one cardinality estimator for all tables. Proceedings of the VLDB Endowment 14, 1 (2020), 61–73

  61. [61]

    Understanding deep learning (still) requires rethinking generalization

    Zhang, C., Bengio, S., Hardt, M., Recht, B., and Vinyals, O. Understanding deep learning (still) requires rethinking generalization. Communications of the ACM 64, 3 (2021), 107–115

  62. [62]

    Advances in Neural Information Processing Systems 35 (2022), 26933–26942

    Zhang, J., W ang, Y., and Scott, C.Learning from label proportions by learning with label noise. Advances in Neural Information Processing Systems 35 (2022), 26933–26942

  63. [63]

    Duet: Ef- ficient and Scalable Hybrid Neural Relation Understanding

    Zhang, K., Wang, H., Lu, Y., Li, Z., Shu, C., Yan, Y., and Yang, D. Duet: Ef- ficient and Scalable Hybrid Neural Relation Understanding. In 2024 IEEE 40th International Conference on Data Engineering (ICDE) (2024), IEEE, pp. 56–69