Recognition: unknown
Sidorenko-Inspired Pessimistic Estimation
Pith reviewed 2026-05-10 10:52 UTC · model grok-4.3
The pith
Caterpillar homomorphism counts tighten upper bounds on database join sizes to an m to the 3/5 overestimation factor.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
By counting homomorphisms from caterpillar graphs rather than stars or bi-stars, the pessimistic estimation method obtains join-size upper bounds whose growth rate with input size m improves to an empirical exponent of 3/5, as measured by linear regression on log-log plots of overestimation factors across simulated instances.
What carries the argument
Caterpillar graphs (trees in which every vertex lies at distance at most one from a central path), whose homomorphism counts into the input relations supply the local entropy upper bounds that are combined via Shannon inequalities to bound the joint entropy of a uniform random join row.
If this is right
- The caterpillar bounds are computable in time linear in the total size of the input tables.
- The same linear-time homomorphism counting works for any fixed caterpillar and any number of input relations.
- The framework unifies star, bi-star, and caterpillar bounds as special cases of a single graph-homomorphism approach.
- The observed improvement in exponent directly reduces the overestimation factor for large m in query optimizers that adopt the bound.
Where Pith is reading between the lines
- If the 3/5 exponent holds in the worst case, then further Sidorenko-satisfying graphs could produce still tighter exponents.
- The method may extend to bounding the number of answers to more general conjunctive queries beyond simple joins.
- Connections to extremal graph theory suggest that resolving parts of Sidorenko's conjecture could yield provably optimal entropy bounds for this database setting.
Load-bearing premise
The log-log regression on simulated random instances accurately captures the true worst-case asymptotic overestimation exponent of the caterpillar bound.
What would settle it
A family of join instances with growing size m for which the ratio of the caterpillar bound to the actual join size exhibits a log-log slope strictly larger than 3/5.
Figures
read the original abstract
Recently, Abo Khamis et al. showed how to upper bound the size of a join of multiple tables, a problem essential to query optimization in database theory. They unified earlier works by the following information-theoretical framework. 1. Let $(X_1,..., X_n)$ be a row selected from the join uniformly at random. 2. The size of the join is now $\exp(H(X_1,..., X_n))$. 3. To upper bound $H(X_1,..., X_n)$, break it into several $\textit{local entropies}$, such as $H(X_1)$, $H(X_2, X_3)$, and $H(X_4|X_5)$, using Shannon-type inequalities. 4. Upper bound local entropies using statistics of the tables being joined. The statistics Abo Khamis et al. considered are the counts of graph homomorphisms from stars to the tables. In a follow-up work, we generalized stars to bi-stars. In this paper, we generalize bi-stars to caterpillars, an even larger class of graphs inspired by Sidorenko's conjecture. Simulations show that, while Abo Khamis et al.'s star bound overestimates the join size by $m$, our bi-star bound overestimates by about $m^{3/4}$, and this paper's new caterpillar bound overestimates by about $m^{3/5}$. These exponents are obtained by log-log regressions with R-square $> 0.98$. All homomorphisms are counted in time linear in the size of the tables being joined.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript extends the information-theoretic framework of Abo Khamis et al. for upper-bounding database join sizes. It generalizes the use of star homomorphism counts to bi-stars and then to caterpillars (inspired by Sidorenko's conjecture) to bound local entropies via Shannon inequalities, yielding an upper bound on the join size exp(H(X_1,...,X_n)). Simulations are used to claim that the caterpillar bounds overestimate join size by ~m^{3/5} (versus m for stars and m^{3/4} for bi-stars), with exponents fitted via log-log regression (R² > 0.98). All relevant homomorphism counts are computable in linear time.
Significance. If the empirical exponents are representative of worst-case asymptotics and the entropy-bounding step is made rigorous, the caterpillar generalization could yield practically tighter pessimistic estimates for query optimization. The linear-time counting property is a concrete algorithmic strength that would be immediately usable. The Sidorenko-inspired choice of graphs is a natural and potentially extensible direction, though the current reliance on simulation rather than derivation or proof limits the theoretical contribution.
major comments (2)
- [Abstract] Abstract and simulation results: the central claim that the caterpillar bound overestimates by m^{3/5} rests on log-log regression of simulated join sizes versus bounds. No argument is supplied that the generated instances realize the supremum scaling of (actual join size / caterpillar upper bound); if they do not, the fitted exponent is an artifact of the test distribution rather than a proven improvement over the bi-star case.
- [Framework description] Framework (steps 3-4): the manuscript asserts that caterpillar homomorphism counts can be substituted for the local entropies H(X_i,...) via Shannon inequalities. No derivation or set of inequalities is exhibited showing that this substitution is valid for arbitrary tables and that it does not introduce looseness beyond what is already present in the star/bi-star cases.
minor comments (1)
- The abstract states that 'all homomorphisms are counted in time linear in the size of the tables'; a brief complexity argument or reference to the counting algorithm should appear in the main text to substantiate this claim.
Simulated Author's Rebuttal
We thank the referee for the careful reading and constructive comments on our manuscript. We address each major comment below, clarifying the scope of our contributions and indicating revisions to improve rigor and presentation.
read point-by-point responses
-
Referee: [Abstract] Abstract and simulation results: the central claim that the caterpillar bound overestimates by m^{3/5} rests on log-log regression of simulated join sizes versus bounds. No argument is supplied that the generated instances realize the supremum scaling of (actual join size / caterpillar upper bound); if they do not, the fitted exponent is an artifact of the test distribution rather than a proven improvement over the bi-star case.
Authors: We agree that the reported exponents are empirical observations from our simulation suite rather than a proof that the instances attain the worst-case supremum scaling. The test distributions were designed to cover varied join cardinalities and selectivities, yielding consistent log-log fits with R² > 0.98, but we do not claim these realize the asymptotic supremum. We will revise the abstract and simulation section to state that the caterpillar bound overestimates join size by approximately m^{3/5} in the evaluated instances (versus m for stars and m^{3/4} for bi-stars), and we will add a paragraph discussing the limitations of simulation-based exponent estimation together with open questions for theoretical confirmation. revision: yes
-
Referee: [Framework description] Framework (steps 3-4): the manuscript asserts that caterpillar homomorphism counts can be substituted for the local entropies H(X_i,...) via Shannon inequalities. No derivation or set of inequalities is exhibited showing that this substitution is valid for arbitrary tables and that it does not introduce looseness beyond what is already present in the star/bi-star cases.
Authors: The substitution is justified by the same Shannon-inequality decomposition used for stars and bi-stars, with the caterpillar homomorphism count supplying the local entropy upper bound via the relation H(local variables) ≤ log(homomorphism count) plus conditional terms that follow the caterpillar's path structure. This does not introduce extra looseness because the caterpillar is a tree and inherits the tight bounding properties from the bi-star case. We acknowledge that the current manuscript states the substitution without exhibiting the explicit chain of inequalities. In revision we will add a short subsection (or appendix) that lists the precise Shannon inequalities for an arbitrary caterpillar and proves that the resulting bound is at least as tight as the bi-star bound for the same local variables. revision: yes
Circularity Check
No significant circularity detected
full rationale
The derivation begins with the external Abo Khamis et al. information-theoretic framework (local entropies bounded via Shannon inequalities) and extends the graph class from stars to caterpillars, drawing inspiration from the external Sidorenko conjecture rather than re-deriving prior results. Homomorphism counts are used directly to bound the local entropies, with linear-time computation stated as a separate algorithmic claim. The reported overestimation exponents (m, m^{3/4}, m^{3/5}) are empirical observations obtained by log-log regression on simulations; they are not presented as theoretical predictions derived from the bound itself. The self-citation to the authors' prior bi-star work is incidental and not load-bearing for the caterpillar generalization or the simulation results. No equation or claim reduces by construction to its own inputs, fitted parameters, or self-citation chain.
Axiom & Free-Parameter Ledger
axioms (2)
- standard math Shannon-type inequalities suffice to upper-bound joint entropy from local entropies
- domain assumption Homomorphism counts from caterpillars to tables upper-bound the required local entropies
Reference graph
Works this paper leans on
-
[1]
How good are query optimizers, really?
V . Leis, A. Gubichev, A. Mirchev, P. Boncz, A. Kemper, and T. Neu- mann, “How good are query optimizers, really?”Proceedings of the VLDB Endowment, vol. 9, no. 3, pp. 204–215, Nov. 2015. 2https://snap.stanford.edu/data/
2015
-
[2]
Cardinality esti- mation in DBMS: A comprehensive benchmark evaluation,
Y . Han, Z. Wu, P. Wu, R. Zhu, J. Yang, L. W. Tan, K. Zeng, G. Cong, Y . Qin, A. Pfadler, Z. Qian, J. Zhou, J. Li, and B. Cui, “Cardinality esti- mation in DBMS: A comprehensive benchmark evaluation,”Proceedings of the VLDB Endowment, vol. 15, no. 4, pp. 752–765, Dec. 2021
2021
-
[3]
In-Memory Subgraph Matching: An In-depth Study,
S. Sun and Q. Luo, “In-Memory Subgraph Matching: An In-depth Study,” inProceedings of the 2020 ACM SIGMOD International Con- ference on Management of Data. Portland OR USA: ACM, Jun. 2020, pp. 1083–1098
2020
-
[4]
G-CARE: A Framework for Performance Benchmarking of Cardinality Estimation Techniques for Subgraph Matching,
Y . Park, S. Ko, S. S. Bhowmick, K. Kim, K. Hong, and W.-S. Han, “G-CARE: A Framework for Performance Benchmarking of Cardinality Estimation Techniques for Subgraph Matching,” inProceedings of the 2020 ACM SIGMOD International Conference on Management of Data. Portland OR USA: ACM, Jun. 2020, pp. 1099–1114
2020
-
[5]
A Survey on Advancing the DBMS Query Optimizer: Cardinality Estimation, Cost Model, and Plan Enumeration,
H. Lan, Z. Bao, and Y . Peng, “A Survey on Advancing the DBMS Query Optimizer: Cardinality Estimation, Cost Model, and Plan Enumeration,” Data Science and Engineering, vol. 6, no. 1, pp. 86–101, Mar. 2021
2021
-
[6]
Data dependencies for query optimization: A survey,
J. Kossmann, T. Papenbrock, and F. Naumann, “Data dependencies for query optimization: A survey,”The VLDB Journal, vol. 31, no. 1, pp. 1–22, Jan. 2022
2022
-
[7]
Size Bounds and Query Plans for Relational Joins,
A. Atserias, M. Grohe, and D. Marx, “Size Bounds and Query Plans for Relational Joins,”SIAM Journal on Computing, vol. 42, no. 4, pp. 1737–1767, Jan. 2013
2013
-
[8]
Skew strikes back: New developments in the theory of join algorithms,
H. Q. Ngo, C. R ´e, and A. Rudra, “Skew strikes back: New developments in the theory of join algorithms,”ACM SIGMOD Record, vol. 42, no. 4, pp. 5–16, Feb. 2014
2014
-
[9]
Triejoin: A Simple, Worst-Case Optimal Join Algo- rithm,
T. Veldhuizen, “Triejoin: A Simple, Worst-Case Optimal Join Algo- rithm,” 2014
2014
-
[10]
Worst-case Optimal Join Algorithms,
H. Q. Ngo, E. Porat, C. R ´e, and A. Rudra, “Worst-case Optimal Join Algorithms,”Journal of the ACM, vol. 65, no. 3, pp. 1–40, Jun. 2018
2018
-
[11]
Computing Join Queries with Functional Dependencies,
M. Abo Khamis, H. Q. Ngo, and D. Suciu, “Computing Join Queries with Functional Dependencies,” inProceedings of the 35th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Sys- tems. San Francisco California USA: ACM, Jun. 2016, pp. 327–342
2016
-
[12]
What Do Shannon-type Inequalities, Submodular Width, and Disjunctive Datalog Have to Do with One Another?
——, “What Do Shannon-type Inequalities, Submodular Width, and Disjunctive Datalog Have to Do with One Another?” inProceedings of the 36th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems. Chicago Illinois USA: ACM, May 2017, pp. 429– 444
2017
-
[13]
Accurate summary-based cardinality estimation through the lens of cardinality estimation graphs,
J. Chen, Y . Huang, M. Wang, S. Salihoglu, and K. Salem, “Accurate summary-based cardinality estimation through the lens of cardinality estimation graphs,”Proceedings of the VLDB Endowment, vol. 15, no. 8, pp. 1533–1545, Apr. 2022
2022
-
[14]
Pessimistic Cardinality Estima- tion: Tighter Upper Bounds for Intermediate Join Cardinalities,
W. Cai, M. Balazinska, and D. Suciu, “Pessimistic Cardinality Estima- tion: Tighter Upper Bounds for Intermediate Join Cardinalities,” 2019
2019
-
[15]
Simplicity Done Right for Join Ordering,
A. Hertzschuch, C. Hartmann, D. Habich, and W. Lehner, “Simplicity Done Right for Join Ordering,”CIDR, 2021
2021
-
[16]
Join Size Bounds using l p -Norms on Degree Sequences,
M. Abo Khamis, V . Nakos, D. Olteanu, and D. Suciu, “Join Size Bounds using l p -Norms on Degree Sequences,”Proceedings of the ACM on Management of Data, vol. 2, no. 2, pp. 1–24, May 2024
2024
-
[17]
Degree Sequence Bound For Join Cardinality Estimation,
K. Deeds, D. Suciu, M. Balazinska, and W. Cai, “Degree Sequence Bound For Join Cardinality Estimation,” Mar. 2022
2022
-
[18]
SafeBound: A Practical System for Generating Cardinality Bounds,
K. B. Deeds, D. Suciu, and M. Balazinska, “SafeBound: A Practical System for Generating Cardinality Bounds,”Proceedings of the ACM on Management of Data, vol. 1, no. 1, pp. 1–26, May 2023
2023
-
[19]
Degree Sequence Bounds,
K. Deeds, D. Suciu, M. Balazinska, and W. Cai, “Degree Sequence Bounds,”ACM Transactions on Database Systems, p. 3716378, May 2025
2025
-
[20]
LPBOUND: Pessimistic Cardinality Estimation Usingℓ p -Norms of Degree Sequences,
H. Zhang, C. Mayer, M. Abo Khamis, D. Olteanu, and D. Suciu, “LPBOUND: Pessimistic Cardinality Estimation Usingℓ p -Norms of Degree Sequences,”Proceedings of the ACM on Management of Data, vol. 3, no. 3, pp. 1–27, Jun. 2025
2025
-
[21]
LP Bound in Action: Cardinality Estimation with One-Sided Guarantees,
C. Mayer, H. Zhang, M. Abo Khamis, D. Olteanu, and D. Suciu, “LP Bound in Action: Cardinality Estimation with One-Sided Guarantees,” inCompanion of the 2025 International Conference on Management of Data. Berlin Germany: ACM, Jun. 2025, pp. 187–190
2025
-
[22]
Ambidextrous degree sequence bounds for pessimistic cardinality estimation,
Y .-T. Lin and H.-P. Wang, “Ambidextrous degree sequence bounds for pessimistic cardinality estimation,” 2025. [Online]. Available: https://arxiv.org/abs/2510.04249
-
[23]
An information theoretic approach to sidorenko’s conjecture,
B. Szegedy, “An information theoretic approach to sidorenko’s conjecture,” 2015. [Online]. Available: https://arxiv.org/abs/1406.6738 APPENDIXA REALCATERPILLARMOMENTS ANDCONVEXITY Theorem 6 in Section V is limted to integer parameters because caterpillars have whole numbers of leaves. Never- theless, the entropy bounds themselves extend to nonnegative rea...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.