arxiv: 2604.14647 · v1 · submitted 2026-04-16 · 💻 cs.IT · math.IT

Recognition: unknown

Sidorenko-Inspired Pessimistic Estimation

Yu-Ting Lin , Hsin-Po Wang

Authors on Pith no claims yet

Pith reviewed 2026-05-10 10:52 UTC · model grok-4.3

classification 💻 cs.IT math.IT

keywords database joinsinformation theoryhomomorphism countingcaterpillar graphsSidorenko conjecturequery optimizationentropy boundssize estimation

0 comments

The pith

Caterpillar homomorphism counts tighten upper bounds on database join sizes to an m to the 3/5 overestimation factor.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper extends an information-theoretic framework for upper-bounding the cardinality of relational joins by replacing star and bi-star graphs with caterpillar graphs when counting homomorphisms into the input tables. This generalization draws from Sidorenko's conjecture to produce a larger family of local entropy bounds that still rely on Shannon inequalities. Simulations on random instances show the new bounds overestimate the true join size by roughly m to the 3/5 power, compared with m for stars and m to the 3/4 for bi-stars, with log-log regressions achieving R-square values above 0.98. The bounds remain computable in linear time in the size of the tables. A sympathetic reader cares because more accurate pessimistic size estimates directly improve the choice of execution plans in database query optimizers.

Core claim

By counting homomorphisms from caterpillar graphs rather than stars or bi-stars, the pessimistic estimation method obtains join-size upper bounds whose growth rate with input size m improves to an empirical exponent of 3/5, as measured by linear regression on log-log plots of overestimation factors across simulated instances.

What carries the argument

Caterpillar graphs (trees in which every vertex lies at distance at most one from a central path), whose homomorphism counts into the input relations supply the local entropy upper bounds that are combined via Shannon inequalities to bound the joint entropy of a uniform random join row.

If this is right

The caterpillar bounds are computable in time linear in the total size of the input tables.
The same linear-time homomorphism counting works for any fixed caterpillar and any number of input relations.
The framework unifies star, bi-star, and caterpillar bounds as special cases of a single graph-homomorphism approach.
The observed improvement in exponent directly reduces the overestimation factor for large m in query optimizers that adopt the bound.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the 3/5 exponent holds in the worst case, then further Sidorenko-satisfying graphs could produce still tighter exponents.
The method may extend to bounding the number of answers to more general conjunctive queries beyond simple joins.
Connections to extremal graph theory suggest that resolving parts of Sidorenko's conjecture could yield provably optimal entropy bounds for this database setting.

Load-bearing premise

The log-log regression on simulated random instances accurately captures the true worst-case asymptotic overestimation exponent of the caterpillar bound.

What would settle it

A family of join instances with growing size m for which the ratio of the caterpillar bound to the actual join size exhibits a log-log slope strictly larger than 3/5.

Figures

Figures reproduced from arXiv: 2604.14647 by Hsin-Po Wang, Yu-Ting Lin.

read the original abstract

Recently, Abo Khamis et al. showed how to upper bound the size of a join of multiple tables, a problem essential to query optimization in database theory. They unified earlier works by the following information-theoretical framework. 1. Let $(X_1,..., X_n)$ be a row selected from the join uniformly at random. 2. The size of the join is now $\exp(H(X_1,..., X_n))$. 3. To upper bound $H(X_1,..., X_n)$, break it into several $\textit{local entropies}$, such as $H(X_1)$, $H(X_2, X_3)$, and $H(X_4|X_5)$, using Shannon-type inequalities. 4. Upper bound local entropies using statistics of the tables being joined. The statistics Abo Khamis et al. considered are the counts of graph homomorphisms from stars to the tables. In a follow-up work, we generalized stars to bi-stars. In this paper, we generalize bi-stars to caterpillars, an even larger class of graphs inspired by Sidorenko's conjecture. Simulations show that, while Abo Khamis et al.'s star bound overestimates the join size by $m$, our bi-star bound overestimates by about $m^{3/4}$, and this paper's new caterpillar bound overestimates by about $m^{3/5}$. These exponents are obtained by log-log regressions with R-square $> 0.98$. All homomorphisms are counted in time linear in the size of the tables being joined.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper extends the Abo Khamis information-theoretic join bounds from stars and bi-stars to caterpillars, reporting simulation exponents that improve from m to m^{3/4} to m^{3/5}, but the support is entirely empirical with no derivations shown.

read the letter

The central new piece is the move to caterpillar graphs for the local entropy upper bounds, chosen because they align with Sidorenko-type ideas. They keep the linear-time homomorphism counting property, which matters for any real implementation. The simulations then show the overestimation factor dropping to roughly m to the 3/5, with clean log-log regressions and R-squared above 0.98 across their test instances. That is a concrete, measurable step beyond the bi-star work if the numbers hold up under broader scrutiny. The linear-time claim is also useful on its own for anyone who wants to plug these bounds into a query optimizer without blowing up the runtime. The main limitation is that the exponent and the validity of the entropy substitution both rest on the simulation distribution and the regression fit. There is no derivation or proof that the caterpillar counts always produce valid Shannon-inequality bounds, nor any argument that the tested cases realize the worst-case scaling. If the simulations miss families where the gap is larger, the 3/5 figure is an artifact rather than a general improvement. The abstract gives no error analysis or sensitivity checks, so the tightness claim stays observational. This is for people working on cardinality estimation in databases who already know the Abo Khamis framework and want to see the next graph family tried out. A practitioner might extract the caterpillar construction and the reported numbers to test in their own system. A theorist looking for a theorem will find the paper thin. It is worth sending to referees because the generalization is a direct, natural extension and the empirical results are presented in a form that can be checked or refuted with more data. Referees will likely press for either a proof sketch or a clearer statement of what the simulations actually establish.

Referee Report

2 major / 1 minor

Summary. The manuscript extends the information-theoretic framework of Abo Khamis et al. for upper-bounding database join sizes. It generalizes the use of star homomorphism counts to bi-stars and then to caterpillars (inspired by Sidorenko's conjecture) to bound local entropies via Shannon inequalities, yielding an upper bound on the join size exp(H(X_1,...,X_n)). Simulations are used to claim that the caterpillar bounds overestimate join size by ~m^{3/5} (versus m for stars and m^{3/4} for bi-stars), with exponents fitted via log-log regression (R² > 0.98). All relevant homomorphism counts are computable in linear time.

Significance. If the empirical exponents are representative of worst-case asymptotics and the entropy-bounding step is made rigorous, the caterpillar generalization could yield practically tighter pessimistic estimates for query optimization. The linear-time counting property is a concrete algorithmic strength that would be immediately usable. The Sidorenko-inspired choice of graphs is a natural and potentially extensible direction, though the current reliance on simulation rather than derivation or proof limits the theoretical contribution.

major comments (2)

[Abstract] Abstract and simulation results: the central claim that the caterpillar bound overestimates by m^{3/5} rests on log-log regression of simulated join sizes versus bounds. No argument is supplied that the generated instances realize the supremum scaling of (actual join size / caterpillar upper bound); if they do not, the fitted exponent is an artifact of the test distribution rather than a proven improvement over the bi-star case.
[Framework description] Framework (steps 3-4): the manuscript asserts that caterpillar homomorphism counts can be substituted for the local entropies H(X_i,...) via Shannon inequalities. No derivation or set of inequalities is exhibited showing that this substitution is valid for arbitrary tables and that it does not introduce looseness beyond what is already present in the star/bi-star cases.

minor comments (1)

The abstract states that 'all homomorphisms are counted in time linear in the size of the tables'; a brief complexity argument or reference to the counting algorithm should appear in the main text to substantiate this claim.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the careful reading and constructive comments on our manuscript. We address each major comment below, clarifying the scope of our contributions and indicating revisions to improve rigor and presentation.

read point-by-point responses

Referee: [Abstract] Abstract and simulation results: the central claim that the caterpillar bound overestimates by m^{3/5} rests on log-log regression of simulated join sizes versus bounds. No argument is supplied that the generated instances realize the supremum scaling of (actual join size / caterpillar upper bound); if they do not, the fitted exponent is an artifact of the test distribution rather than a proven improvement over the bi-star case.

Authors: We agree that the reported exponents are empirical observations from our simulation suite rather than a proof that the instances attain the worst-case supremum scaling. The test distributions were designed to cover varied join cardinalities and selectivities, yielding consistent log-log fits with R² > 0.98, but we do not claim these realize the asymptotic supremum. We will revise the abstract and simulation section to state that the caterpillar bound overestimates join size by approximately m^{3/5} in the evaluated instances (versus m for stars and m^{3/4} for bi-stars), and we will add a paragraph discussing the limitations of simulation-based exponent estimation together with open questions for theoretical confirmation. revision: yes
Referee: [Framework description] Framework (steps 3-4): the manuscript asserts that caterpillar homomorphism counts can be substituted for the local entropies H(X_i,...) via Shannon inequalities. No derivation or set of inequalities is exhibited showing that this substitution is valid for arbitrary tables and that it does not introduce looseness beyond what is already present in the star/bi-star cases.

Authors: The substitution is justified by the same Shannon-inequality decomposition used for stars and bi-stars, with the caterpillar homomorphism count supplying the local entropy upper bound via the relation H(local variables) ≤ log(homomorphism count) plus conditional terms that follow the caterpillar's path structure. This does not introduce extra looseness because the caterpillar is a tree and inherits the tight bounding properties from the bi-star case. We acknowledge that the current manuscript states the substitution without exhibiting the explicit chain of inequalities. In revision we will add a short subsection (or appendix) that lists the precise Shannon inequalities for an arbitrary caterpillar and proves that the resulting bound is at least as tight as the bi-star bound for the same local variables. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The derivation begins with the external Abo Khamis et al. information-theoretic framework (local entropies bounded via Shannon inequalities) and extends the graph class from stars to caterpillars, drawing inspiration from the external Sidorenko conjecture rather than re-deriving prior results. Homomorphism counts are used directly to bound the local entropies, with linear-time computation stated as a separate algorithmic claim. The reported overestimation exponents (m, m^{3/4}, m^{3/5}) are empirical observations obtained by log-log regression on simulations; they are not presented as theoretical predictions derived from the bound itself. The self-citation to the authors' prior bi-star work is incidental and not load-bearing for the caterpillar generalization or the simulation results. No equation or claim reduces by construction to its own inputs, fitted parameters, or self-citation chain.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The approach rests on the standard Shannon-type inequalities for entropy decomposition and the assumption that caterpillar homomorphism counts provide valid upper bounds on the chosen local entropies; no free parameters or invented entities are visible in the abstract.

axioms (2)

standard math Shannon-type inequalities suffice to upper-bound joint entropy from local entropies
Step 3 of the framework explicitly invokes them.
domain assumption Homomorphism counts from caterpillars to tables upper-bound the required local entropies
Core of the new contribution; validity not shown in abstract.

pith-pipeline@v0.9.0 · 5592 in / 1317 out tokens · 57119 ms · 2026-05-10T10:52:41.360233+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

23 extracted references · 2 canonical work pages

[1]

How good are query optimizers, really?

V . Leis, A. Gubichev, A. Mirchev, P. Boncz, A. Kemper, and T. Neu- mann, “How good are query optimizers, really?”Proceedings of the VLDB Endowment, vol. 9, no. 3, pp. 204–215, Nov. 2015. 2https://snap.stanford.edu/data/

2015
[2]

Cardinality esti- mation in DBMS: A comprehensive benchmark evaluation,

Y . Han, Z. Wu, P. Wu, R. Zhu, J. Yang, L. W. Tan, K. Zeng, G. Cong, Y . Qin, A. Pfadler, Z. Qian, J. Zhou, J. Li, and B. Cui, “Cardinality esti- mation in DBMS: A comprehensive benchmark evaluation,”Proceedings of the VLDB Endowment, vol. 15, no. 4, pp. 752–765, Dec. 2021

2021
[3]

In-Memory Subgraph Matching: An In-depth Study,

S. Sun and Q. Luo, “In-Memory Subgraph Matching: An In-depth Study,” inProceedings of the 2020 ACM SIGMOD International Con- ference on Management of Data. Portland OR USA: ACM, Jun. 2020, pp. 1083–1098

2020
[4]

G-CARE: A Framework for Performance Benchmarking of Cardinality Estimation Techniques for Subgraph Matching,

Y . Park, S. Ko, S. S. Bhowmick, K. Kim, K. Hong, and W.-S. Han, “G-CARE: A Framework for Performance Benchmarking of Cardinality Estimation Techniques for Subgraph Matching,” inProceedings of the 2020 ACM SIGMOD International Conference on Management of Data. Portland OR USA: ACM, Jun. 2020, pp. 1099–1114

2020
[5]

A Survey on Advancing the DBMS Query Optimizer: Cardinality Estimation, Cost Model, and Plan Enumeration,

H. Lan, Z. Bao, and Y . Peng, “A Survey on Advancing the DBMS Query Optimizer: Cardinality Estimation, Cost Model, and Plan Enumeration,” Data Science and Engineering, vol. 6, no. 1, pp. 86–101, Mar. 2021

2021
[6]

Data dependencies for query optimization: A survey,

J. Kossmann, T. Papenbrock, and F. Naumann, “Data dependencies for query optimization: A survey,”The VLDB Journal, vol. 31, no. 1, pp. 1–22, Jan. 2022

2022
[7]

Size Bounds and Query Plans for Relational Joins,

A. Atserias, M. Grohe, and D. Marx, “Size Bounds and Query Plans for Relational Joins,”SIAM Journal on Computing, vol. 42, no. 4, pp. 1737–1767, Jan. 2013

2013
[8]

Skew strikes back: New developments in the theory of join algorithms,

H. Q. Ngo, C. R ´e, and A. Rudra, “Skew strikes back: New developments in the theory of join algorithms,”ACM SIGMOD Record, vol. 42, no. 4, pp. 5–16, Feb. 2014

2014
[9]

Triejoin: A Simple, Worst-Case Optimal Join Algo- rithm,

T. Veldhuizen, “Triejoin: A Simple, Worst-Case Optimal Join Algo- rithm,” 2014

2014
[10]

Worst-case Optimal Join Algorithms,

H. Q. Ngo, E. Porat, C. R ´e, and A. Rudra, “Worst-case Optimal Join Algorithms,”Journal of the ACM, vol. 65, no. 3, pp. 1–40, Jun. 2018

2018
[11]

Computing Join Queries with Functional Dependencies,

M. Abo Khamis, H. Q. Ngo, and D. Suciu, “Computing Join Queries with Functional Dependencies,” inProceedings of the 35th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Sys- tems. San Francisco California USA: ACM, Jun. 2016, pp. 327–342

2016
[12]

What Do Shannon-type Inequalities, Submodular Width, and Disjunctive Datalog Have to Do with One Another?

——, “What Do Shannon-type Inequalities, Submodular Width, and Disjunctive Datalog Have to Do with One Another?” inProceedings of the 36th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems. Chicago Illinois USA: ACM, May 2017, pp. 429– 444

2017
[13]

Accurate summary-based cardinality estimation through the lens of cardinality estimation graphs,

J. Chen, Y . Huang, M. Wang, S. Salihoglu, and K. Salem, “Accurate summary-based cardinality estimation through the lens of cardinality estimation graphs,”Proceedings of the VLDB Endowment, vol. 15, no. 8, pp. 1533–1545, Apr. 2022

2022
[14]

Pessimistic Cardinality Estima- tion: Tighter Upper Bounds for Intermediate Join Cardinalities,

W. Cai, M. Balazinska, and D. Suciu, “Pessimistic Cardinality Estima- tion: Tighter Upper Bounds for Intermediate Join Cardinalities,” 2019

2019
[15]

Simplicity Done Right for Join Ordering,

A. Hertzschuch, C. Hartmann, D. Habich, and W. Lehner, “Simplicity Done Right for Join Ordering,”CIDR, 2021

2021
[16]

Join Size Bounds using l p -Norms on Degree Sequences,

M. Abo Khamis, V . Nakos, D. Olteanu, and D. Suciu, “Join Size Bounds using l p -Norms on Degree Sequences,”Proceedings of the ACM on Management of Data, vol. 2, no. 2, pp. 1–24, May 2024

2024
[17]

Degree Sequence Bound For Join Cardinality Estimation,

K. Deeds, D. Suciu, M. Balazinska, and W. Cai, “Degree Sequence Bound For Join Cardinality Estimation,” Mar. 2022

2022
[18]

SafeBound: A Practical System for Generating Cardinality Bounds,

K. B. Deeds, D. Suciu, and M. Balazinska, “SafeBound: A Practical System for Generating Cardinality Bounds,”Proceedings of the ACM on Management of Data, vol. 1, no. 1, pp. 1–26, May 2023

2023
[19]

Degree Sequence Bounds,

K. Deeds, D. Suciu, M. Balazinska, and W. Cai, “Degree Sequence Bounds,”ACM Transactions on Database Systems, p. 3716378, May 2025

2025
[20]

LPBOUND: Pessimistic Cardinality Estimation Usingℓ p -Norms of Degree Sequences,

H. Zhang, C. Mayer, M. Abo Khamis, D. Olteanu, and D. Suciu, “LPBOUND: Pessimistic Cardinality Estimation Usingℓ p -Norms of Degree Sequences,”Proceedings of the ACM on Management of Data, vol. 3, no. 3, pp. 1–27, Jun. 2025

2025
[21]

LP Bound in Action: Cardinality Estimation with One-Sided Guarantees,

C. Mayer, H. Zhang, M. Abo Khamis, D. Olteanu, and D. Suciu, “LP Bound in Action: Cardinality Estimation with One-Sided Guarantees,” inCompanion of the 2025 International Conference on Management of Data. Berlin Germany: ACM, Jun. 2025, pp. 187–190

2025
[22]

Ambidextrous degree sequence bounds for pessimistic cardinality estimation,

Y .-T. Lin and H.-P. Wang, “Ambidextrous degree sequence bounds for pessimistic cardinality estimation,” 2025. [Online]. Available: https://arxiv.org/abs/2510.04249

work page arXiv 2025
[23]

An information theoretic approach to sidorenko’s conjecture,

B. Szegedy, “An information theoretic approach to sidorenko’s conjecture,” 2015. [Online]. Available: https://arxiv.org/abs/1406.6738 APPENDIXA REALCATERPILLARMOMENTS ANDCONVEXITY Theorem 6 in Section V is limted to integer parameters because caterpillars have whole numbers of leaves. Never- theless, the entropy bounds themselves extend to nonnegative rea...

work page arXiv 2015