pith. machine review for the scientific record. sign in

arxiv: 2604.14647 · v1 · submitted 2026-04-16 · 💻 cs.IT · math.IT

Recognition: unknown

Sidorenko-Inspired Pessimistic Estimation

Authors on Pith no claims yet

Pith reviewed 2026-05-10 10:52 UTC · model grok-4.3

classification 💻 cs.IT math.IT
keywords database joinsinformation theoryhomomorphism countingcaterpillar graphsSidorenko conjecturequery optimizationentropy boundssize estimation
0
0 comments X

The pith

Caterpillar homomorphism counts tighten upper bounds on database join sizes to an m to the 3/5 overestimation factor.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper extends an information-theoretic framework for upper-bounding the cardinality of relational joins by replacing star and bi-star graphs with caterpillar graphs when counting homomorphisms into the input tables. This generalization draws from Sidorenko's conjecture to produce a larger family of local entropy bounds that still rely on Shannon inequalities. Simulations on random instances show the new bounds overestimate the true join size by roughly m to the 3/5 power, compared with m for stars and m to the 3/4 for bi-stars, with log-log regressions achieving R-square values above 0.98. The bounds remain computable in linear time in the size of the tables. A sympathetic reader cares because more accurate pessimistic size estimates directly improve the choice of execution plans in database query optimizers.

Core claim

By counting homomorphisms from caterpillar graphs rather than stars or bi-stars, the pessimistic estimation method obtains join-size upper bounds whose growth rate with input size m improves to an empirical exponent of 3/5, as measured by linear regression on log-log plots of overestimation factors across simulated instances.

What carries the argument

Caterpillar graphs (trees in which every vertex lies at distance at most one from a central path), whose homomorphism counts into the input relations supply the local entropy upper bounds that are combined via Shannon inequalities to bound the joint entropy of a uniform random join row.

If this is right

  • The caterpillar bounds are computable in time linear in the total size of the input tables.
  • The same linear-time homomorphism counting works for any fixed caterpillar and any number of input relations.
  • The framework unifies star, bi-star, and caterpillar bounds as special cases of a single graph-homomorphism approach.
  • The observed improvement in exponent directly reduces the overestimation factor for large m in query optimizers that adopt the bound.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If the 3/5 exponent holds in the worst case, then further Sidorenko-satisfying graphs could produce still tighter exponents.
  • The method may extend to bounding the number of answers to more general conjunctive queries beyond simple joins.
  • Connections to extremal graph theory suggest that resolving parts of Sidorenko's conjecture could yield provably optimal entropy bounds for this database setting.

Load-bearing premise

The log-log regression on simulated random instances accurately captures the true worst-case asymptotic overestimation exponent of the caterpillar bound.

What would settle it

A family of join instances with growing size m for which the ratio of the caterpillar bound to the actual join size exhibits a log-log slope strictly larger than 3/5.

Figures

Figures reproduced from arXiv: 2604.14647 by Hsin-Po Wang, Yu-Ting Lin.

Figure 1
Figure 1. Figure 1: The horizontal axis is the actual number of homomorphisms; the [PITH_FULL_IMAGE:figures/full_fig_p005_1.png] view at source ↗
read the original abstract

Recently, Abo Khamis et al. showed how to upper bound the size of a join of multiple tables, a problem essential to query optimization in database theory. They unified earlier works by the following information-theoretical framework. 1. Let $(X_1,..., X_n)$ be a row selected from the join uniformly at random. 2. The size of the join is now $\exp(H(X_1,..., X_n))$. 3. To upper bound $H(X_1,..., X_n)$, break it into several $\textit{local entropies}$, such as $H(X_1)$, $H(X_2, X_3)$, and $H(X_4|X_5)$, using Shannon-type inequalities. 4. Upper bound local entropies using statistics of the tables being joined. The statistics Abo Khamis et al. considered are the counts of graph homomorphisms from stars to the tables. In a follow-up work, we generalized stars to bi-stars. In this paper, we generalize bi-stars to caterpillars, an even larger class of graphs inspired by Sidorenko's conjecture. Simulations show that, while Abo Khamis et al.'s star bound overestimates the join size by $m$, our bi-star bound overestimates by about $m^{3/4}$, and this paper's new caterpillar bound overestimates by about $m^{3/5}$. These exponents are obtained by log-log regressions with R-square $> 0.98$. All homomorphisms are counted in time linear in the size of the tables being joined.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript extends the information-theoretic framework of Abo Khamis et al. for upper-bounding database join sizes. It generalizes the use of star homomorphism counts to bi-stars and then to caterpillars (inspired by Sidorenko's conjecture) to bound local entropies via Shannon inequalities, yielding an upper bound on the join size exp(H(X_1,...,X_n)). Simulations are used to claim that the caterpillar bounds overestimate join size by ~m^{3/5} (versus m for stars and m^{3/4} for bi-stars), with exponents fitted via log-log regression (R² > 0.98). All relevant homomorphism counts are computable in linear time.

Significance. If the empirical exponents are representative of worst-case asymptotics and the entropy-bounding step is made rigorous, the caterpillar generalization could yield practically tighter pessimistic estimates for query optimization. The linear-time counting property is a concrete algorithmic strength that would be immediately usable. The Sidorenko-inspired choice of graphs is a natural and potentially extensible direction, though the current reliance on simulation rather than derivation or proof limits the theoretical contribution.

major comments (2)
  1. [Abstract] Abstract and simulation results: the central claim that the caterpillar bound overestimates by m^{3/5} rests on log-log regression of simulated join sizes versus bounds. No argument is supplied that the generated instances realize the supremum scaling of (actual join size / caterpillar upper bound); if they do not, the fitted exponent is an artifact of the test distribution rather than a proven improvement over the bi-star case.
  2. [Framework description] Framework (steps 3-4): the manuscript asserts that caterpillar homomorphism counts can be substituted for the local entropies H(X_i,...) via Shannon inequalities. No derivation or set of inequalities is exhibited showing that this substitution is valid for arbitrary tables and that it does not introduce looseness beyond what is already present in the star/bi-star cases.
minor comments (1)
  1. The abstract states that 'all homomorphisms are counted in time linear in the size of the tables'; a brief complexity argument or reference to the counting algorithm should appear in the main text to substantiate this claim.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the careful reading and constructive comments on our manuscript. We address each major comment below, clarifying the scope of our contributions and indicating revisions to improve rigor and presentation.

read point-by-point responses
  1. Referee: [Abstract] Abstract and simulation results: the central claim that the caterpillar bound overestimates by m^{3/5} rests on log-log regression of simulated join sizes versus bounds. No argument is supplied that the generated instances realize the supremum scaling of (actual join size / caterpillar upper bound); if they do not, the fitted exponent is an artifact of the test distribution rather than a proven improvement over the bi-star case.

    Authors: We agree that the reported exponents are empirical observations from our simulation suite rather than a proof that the instances attain the worst-case supremum scaling. The test distributions were designed to cover varied join cardinalities and selectivities, yielding consistent log-log fits with R² > 0.98, but we do not claim these realize the asymptotic supremum. We will revise the abstract and simulation section to state that the caterpillar bound overestimates join size by approximately m^{3/5} in the evaluated instances (versus m for stars and m^{3/4} for bi-stars), and we will add a paragraph discussing the limitations of simulation-based exponent estimation together with open questions for theoretical confirmation. revision: yes

  2. Referee: [Framework description] Framework (steps 3-4): the manuscript asserts that caterpillar homomorphism counts can be substituted for the local entropies H(X_i,...) via Shannon inequalities. No derivation or set of inequalities is exhibited showing that this substitution is valid for arbitrary tables and that it does not introduce looseness beyond what is already present in the star/bi-star cases.

    Authors: The substitution is justified by the same Shannon-inequality decomposition used for stars and bi-stars, with the caterpillar homomorphism count supplying the local entropy upper bound via the relation H(local variables) ≤ log(homomorphism count) plus conditional terms that follow the caterpillar's path structure. This does not introduce extra looseness because the caterpillar is a tree and inherits the tight bounding properties from the bi-star case. We acknowledge that the current manuscript states the substitution without exhibiting the explicit chain of inequalities. In revision we will add a short subsection (or appendix) that lists the precise Shannon inequalities for an arbitrary caterpillar and proves that the resulting bound is at least as tight as the bi-star bound for the same local variables. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The derivation begins with the external Abo Khamis et al. information-theoretic framework (local entropies bounded via Shannon inequalities) and extends the graph class from stars to caterpillars, drawing inspiration from the external Sidorenko conjecture rather than re-deriving prior results. Homomorphism counts are used directly to bound the local entropies, with linear-time computation stated as a separate algorithmic claim. The reported overestimation exponents (m, m^{3/4}, m^{3/5}) are empirical observations obtained by log-log regression on simulations; they are not presented as theoretical predictions derived from the bound itself. The self-citation to the authors' prior bi-star work is incidental and not load-bearing for the caterpillar generalization or the simulation results. No equation or claim reduces by construction to its own inputs, fitted parameters, or self-citation chain.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The approach rests on the standard Shannon-type inequalities for entropy decomposition and the assumption that caterpillar homomorphism counts provide valid upper bounds on the chosen local entropies; no free parameters or invented entities are visible in the abstract.

axioms (2)
  • standard math Shannon-type inequalities suffice to upper-bound joint entropy from local entropies
    Step 3 of the framework explicitly invokes them.
  • domain assumption Homomorphism counts from caterpillars to tables upper-bound the required local entropies
    Core of the new contribution; validity not shown in abstract.

pith-pipeline@v0.9.0 · 5592 in / 1317 out tokens · 57119 ms · 2026-05-10T10:52:41.360233+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

23 extracted references · 2 canonical work pages

  1. [1]

    How good are query optimizers, really?

    V . Leis, A. Gubichev, A. Mirchev, P. Boncz, A. Kemper, and T. Neu- mann, “How good are query optimizers, really?”Proceedings of the VLDB Endowment, vol. 9, no. 3, pp. 204–215, Nov. 2015. 2https://snap.stanford.edu/data/

  2. [2]

    Cardinality esti- mation in DBMS: A comprehensive benchmark evaluation,

    Y . Han, Z. Wu, P. Wu, R. Zhu, J. Yang, L. W. Tan, K. Zeng, G. Cong, Y . Qin, A. Pfadler, Z. Qian, J. Zhou, J. Li, and B. Cui, “Cardinality esti- mation in DBMS: A comprehensive benchmark evaluation,”Proceedings of the VLDB Endowment, vol. 15, no. 4, pp. 752–765, Dec. 2021

  3. [3]

    In-Memory Subgraph Matching: An In-depth Study,

    S. Sun and Q. Luo, “In-Memory Subgraph Matching: An In-depth Study,” inProceedings of the 2020 ACM SIGMOD International Con- ference on Management of Data. Portland OR USA: ACM, Jun. 2020, pp. 1083–1098

  4. [4]

    G-CARE: A Framework for Performance Benchmarking of Cardinality Estimation Techniques for Subgraph Matching,

    Y . Park, S. Ko, S. S. Bhowmick, K. Kim, K. Hong, and W.-S. Han, “G-CARE: A Framework for Performance Benchmarking of Cardinality Estimation Techniques for Subgraph Matching,” inProceedings of the 2020 ACM SIGMOD International Conference on Management of Data. Portland OR USA: ACM, Jun. 2020, pp. 1099–1114

  5. [5]

    A Survey on Advancing the DBMS Query Optimizer: Cardinality Estimation, Cost Model, and Plan Enumeration,

    H. Lan, Z. Bao, and Y . Peng, “A Survey on Advancing the DBMS Query Optimizer: Cardinality Estimation, Cost Model, and Plan Enumeration,” Data Science and Engineering, vol. 6, no. 1, pp. 86–101, Mar. 2021

  6. [6]

    Data dependencies for query optimization: A survey,

    J. Kossmann, T. Papenbrock, and F. Naumann, “Data dependencies for query optimization: A survey,”The VLDB Journal, vol. 31, no. 1, pp. 1–22, Jan. 2022

  7. [7]

    Size Bounds and Query Plans for Relational Joins,

    A. Atserias, M. Grohe, and D. Marx, “Size Bounds and Query Plans for Relational Joins,”SIAM Journal on Computing, vol. 42, no. 4, pp. 1737–1767, Jan. 2013

  8. [8]

    Skew strikes back: New developments in the theory of join algorithms,

    H. Q. Ngo, C. R ´e, and A. Rudra, “Skew strikes back: New developments in the theory of join algorithms,”ACM SIGMOD Record, vol. 42, no. 4, pp. 5–16, Feb. 2014

  9. [9]

    Triejoin: A Simple, Worst-Case Optimal Join Algo- rithm,

    T. Veldhuizen, “Triejoin: A Simple, Worst-Case Optimal Join Algo- rithm,” 2014

  10. [10]

    Worst-case Optimal Join Algorithms,

    H. Q. Ngo, E. Porat, C. R ´e, and A. Rudra, “Worst-case Optimal Join Algorithms,”Journal of the ACM, vol. 65, no. 3, pp. 1–40, Jun. 2018

  11. [11]

    Computing Join Queries with Functional Dependencies,

    M. Abo Khamis, H. Q. Ngo, and D. Suciu, “Computing Join Queries with Functional Dependencies,” inProceedings of the 35th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Sys- tems. San Francisco California USA: ACM, Jun. 2016, pp. 327–342

  12. [12]

    What Do Shannon-type Inequalities, Submodular Width, and Disjunctive Datalog Have to Do with One Another?

    ——, “What Do Shannon-type Inequalities, Submodular Width, and Disjunctive Datalog Have to Do with One Another?” inProceedings of the 36th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems. Chicago Illinois USA: ACM, May 2017, pp. 429– 444

  13. [13]

    Accurate summary-based cardinality estimation through the lens of cardinality estimation graphs,

    J. Chen, Y . Huang, M. Wang, S. Salihoglu, and K. Salem, “Accurate summary-based cardinality estimation through the lens of cardinality estimation graphs,”Proceedings of the VLDB Endowment, vol. 15, no. 8, pp. 1533–1545, Apr. 2022

  14. [14]

    Pessimistic Cardinality Estima- tion: Tighter Upper Bounds for Intermediate Join Cardinalities,

    W. Cai, M. Balazinska, and D. Suciu, “Pessimistic Cardinality Estima- tion: Tighter Upper Bounds for Intermediate Join Cardinalities,” 2019

  15. [15]

    Simplicity Done Right for Join Ordering,

    A. Hertzschuch, C. Hartmann, D. Habich, and W. Lehner, “Simplicity Done Right for Join Ordering,”CIDR, 2021

  16. [16]

    Join Size Bounds using l p -Norms on Degree Sequences,

    M. Abo Khamis, V . Nakos, D. Olteanu, and D. Suciu, “Join Size Bounds using l p -Norms on Degree Sequences,”Proceedings of the ACM on Management of Data, vol. 2, no. 2, pp. 1–24, May 2024

  17. [17]

    Degree Sequence Bound For Join Cardinality Estimation,

    K. Deeds, D. Suciu, M. Balazinska, and W. Cai, “Degree Sequence Bound For Join Cardinality Estimation,” Mar. 2022

  18. [18]

    SafeBound: A Practical System for Generating Cardinality Bounds,

    K. B. Deeds, D. Suciu, and M. Balazinska, “SafeBound: A Practical System for Generating Cardinality Bounds,”Proceedings of the ACM on Management of Data, vol. 1, no. 1, pp. 1–26, May 2023

  19. [19]

    Degree Sequence Bounds,

    K. Deeds, D. Suciu, M. Balazinska, and W. Cai, “Degree Sequence Bounds,”ACM Transactions on Database Systems, p. 3716378, May 2025

  20. [20]

    LPBOUND: Pessimistic Cardinality Estimation Usingℓ p -Norms of Degree Sequences,

    H. Zhang, C. Mayer, M. Abo Khamis, D. Olteanu, and D. Suciu, “LPBOUND: Pessimistic Cardinality Estimation Usingℓ p -Norms of Degree Sequences,”Proceedings of the ACM on Management of Data, vol. 3, no. 3, pp. 1–27, Jun. 2025

  21. [21]

    LP Bound in Action: Cardinality Estimation with One-Sided Guarantees,

    C. Mayer, H. Zhang, M. Abo Khamis, D. Olteanu, and D. Suciu, “LP Bound in Action: Cardinality Estimation with One-Sided Guarantees,” inCompanion of the 2025 International Conference on Management of Data. Berlin Germany: ACM, Jun. 2025, pp. 187–190

  22. [22]

    Ambidextrous degree sequence bounds for pessimistic cardinality estimation,

    Y .-T. Lin and H.-P. Wang, “Ambidextrous degree sequence bounds for pessimistic cardinality estimation,” 2025. [Online]. Available: https://arxiv.org/abs/2510.04249

  23. [23]

    An information theoretic approach to sidorenko’s conjecture,

    B. Szegedy, “An information theoretic approach to sidorenko’s conjecture,” 2015. [Online]. Available: https://arxiv.org/abs/1406.6738 APPENDIXA REALCATERPILLARMOMENTS ANDCONVEXITY Theorem 6 in Section V is limted to integer parameters because caterpillars have whole numbers of leaves. Never- theless, the entropy bounds themselves extend to nonnegative rea...