Query Optimization and Evaluation via Information Theory: A Tutorial
Pith reviewed 2026-05-10 19:30 UTC · model grok-4.3
The pith
The PANDA framework derives conjunctive query plans directly from proofs of information-theoretic bounds on intermediate cardinalities to match specialized algorithm runtimes.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
By computing information-theoretically tight upper bounds on the cardinalities of intermediate relations and constructing query plans directly from the mathematical proofs of those bounds, the PANDA framework yields a generic algorithm for conjunctive query evaluation whose performance matches or subsumes that of specialized algorithms for the same problems.
What carries the argument
The PANDA framework, which translates information-theoretically tight cardinality upper bounds into executable query plans by following the structure of their proofs.
If this is right
- A single optimizer can achieve asymptotically optimal runtimes for many different conjunctive queries without requiring separate code for each problem class.
- Query plans can be generated automatically by mirroring the steps in the proof of a cardinality bound.
- Techniques from database theory, constraint satisfaction, and graph algorithms become interchangeable under the same bounding method.
- Performance can equal or exceed that of algorithms relying on fast matrix multiplication for suitable queries.
Where Pith is reading between the lines
- Database systems could adopt bound-derivation techniques to improve performance on mixed workloads without maintaining many specialized code paths.
- The same proof-to-plan coupling might be tested on query classes beyond conjunctive queries, such as those with aggregates or disjunctions.
- Practical implementations would need to measure the exact overhead of bound computation on real data to confirm the claimed advantage holds outside theory.
Load-bearing premise
The information-theoretically tight upper bounds on intermediate cardinalities can be computed and turned into executable plans with overhead low enough to preserve the asymptotic runtime advantage.
What would settle it
A concrete conjunctive query together with input statistics for which a PANDA-derived plan runs asymptotically slower than a published specialized algorithm, or for which computing the bounds takes longer than the evaluation savings they provide.
Figures
read the original abstract
Database theory is exciting because it studies highly general and practically useful abstractions. Conjunctive query (CQ) evaluation is a prime example: it simultaneously generalizes graph pattern matching, constraint satisfaction, and statistical inference, among others. This generality is both the strength and the central challenge of the field. The query optimization and evaluation problem is fundamentally a "meta-algorithm" problem: given a query $Q$ and statistics $\cal S$ about the input database, how should one best answer $Q$? Because the problem is so general, it is often impossible for such a meta-algorithm to match the runtimes of specialized algorithms designed for a fixed query -- or so it seemed. The past fifteen years have witnessed an exciting development in database theory: a general framework, called PANDA, that emerged from advances in database theory, constraint satisfaction problems (CSP), and graph algorithms, for evaluating conjunctive queries given input data statistics. The key idea is to derive information-theoretically tight upper bounds on the cardinalities of intermediate relations produced during query evaluation. These bounds determine the costs of query plans, and crucially, the query plans themselves are derived directly from the mathematical proof of the upper bound. This tight coupling of proof and algorithm is what makes PANDA both principled and powerful. Remarkably, this generic algorithm matches -- and in some cases subsumes -- the runtimes of specialized algorithms for the same problems, including algorithms that exploit fast matrix multiplication. This paper is a tutorial on the PANDA framework. We illustrate the key ideas through concrete examples, conveying the main intuitions behind the theory.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript is a tutorial on the PANDA framework for conjunctive query optimization and evaluation. It describes how information-theoretically tight upper bounds on the cardinalities of intermediate relations are derived from entropy inequalities and used to construct query plans directly from the mathematical proofs of these bounds. The tutorial claims that this approach allows a generic algorithm to match or subsume the runtimes of specialized algorithms for the same problems, including those that use fast matrix multiplication, and illustrates the ideas with concrete examples.
Significance. If the tutorial accurately conveys the PANDA framework's ability to achieve tight bounds and efficient plans, it could be significant for the database theory community by making advanced results from the intersection of database theory, CSP, and graph algorithms more accessible. The tight coupling of proofs to algorithms is a notable strength that could inspire new query optimization techniques.
major comments (2)
- [Discussion of runtime matching and PANDA framework] The central claim that PANDA matches the runtimes of specialized algorithms (including fast matrix multiplication) is load-bearing, but the tutorial does not address the computational overhead of computing the optimal information-theoretic bounds (via linear programs over entropy variables, which can be exponential in size for general CQs) or of translating the proof structure into an executable plan. This overhead must be shown to be asymptotically negligible to substantiate the claim.
- [Examples illustrating the key ideas] In the concrete examples used to illustrate the key ideas, include explicit calculations of the bound derivation (e.g., the LP or entropy inequalities used) and the resulting plan, along with a comparison to the runtime of the specialized algorithm it is claimed to match.
minor comments (2)
- [Notation and definitions] Ensure consistent use of notation for statistics S and query Q throughout the tutorial.
- [References] Add references to the original PANDA papers for readers who wish to delve deeper into the proofs.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback and positive view of the tutorial's potential impact. We address the major comments point by point below and will revise the manuscript accordingly to strengthen the presentation.
read point-by-point responses
-
Referee: The central claim that PANDA matches the runtimes of specialized algorithms (including fast matrix multiplication) is load-bearing, but the tutorial does not address the computational overhead of computing the optimal information-theoretic bounds (via linear programs over entropy variables, which can be exponential in size for general CQs) or of translating the proof structure into an executable plan. This overhead must be shown to be asymptotically negligible to substantiate the claim.
Authors: We agree that an explicit discussion of overhead is needed to fully substantiate the runtime claims. In the PANDA framework the entropy LP is solved once per query; under standard data-complexity assumptions (fixed query) its size is constant and therefore asymptotically negligible relative to input size. Plan construction follows the proof structure and runs in time polynomial in the (query-dependent) proof size. We will add a dedicated subsection on these complexity considerations, clarifying that the matching runtimes refer to the data-dependent evaluation phase and that the overhead is acceptable for the fixed-query setting in which specialized algorithms are also analyzed. revision: yes
-
Referee: In the concrete examples used to illustrate the key ideas, include explicit calculations of the bound derivation (e.g., the LP or entropy inequalities used) and the resulting plan, along with a comparison to the runtime of the specialized algorithm it is claimed to match.
Authors: We appreciate this suggestion for making the tutorial more self-contained. The current examples were kept concise to convey intuition, but we will expand them to show the explicit entropy inequalities (or the corresponding LP), the query plan derived directly from the proof, and a side-by-side runtime comparison with the specialized algorithm (including fast-matrix-multiplication baselines). These additions will be placed in the relevant example sections without altering the overall tutorial structure. revision: yes
Circularity Check
No circularity: tutorial presents prior PANDA results without self-referential derivations
full rationale
The paper is explicitly an expository tutorial on the established PANDA framework from prior publications. It describes the derivation of information-theoretic cardinality bounds and their translation into query plans, but introduces no new equations, predictions, or first-principles results within this text that reduce to fitted parameters, self-definitions, or self-citation chains by construction. Central claims about matching specialized runtimes (including fast matrix multiplication) are attributed to the framework's existing properties rather than being re-derived here in a tautological manner. No load-bearing step equates output to input via the paper's own equations.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Information-theoretically tight upper bounds on intermediate cardinalities exist and can be derived for conjunctive queries given input statistics.
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
The key idea is to derive information-theoretically tight upper bounds on the cardinalities of intermediate relations... turning each proof step into an atomic relational operator
-
IndisputableMonolith/Foundation/AbsoluteFloorClosure.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
submodular width... Shannon-flow inequalities
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
M. Abo Khamis and H. Chen,Jaguar: A Primal Algorithm for Conjunctive Query Evaluation in Submodular-Width Time, Proc. ACM Manag. Data, (2026)
work page 2026
-
[2]
M. Abo Khamis, R. R. Curtin, B. Moseley, H. Q. Ngo, X. Nguyen, D. Olteanu, and M. Schleich,Functional aggregate queries with additive inequalities, ACM Trans. Database Syst., 45 (2020)
work page 2020
-
[3]
M. Abo Khamis, V. Nakos, D. Olteanu, and D. Suciu,Join size bounds using lp-norms on degree sequences, Proc. ACM Manag. Data, 2 (2024)
work page 2024
-
[4]
M. Abo Khamis, H. Q. Ngo, and A. Rudra,FAQ: questions asked frequently, in Proceedings of the 35th ACM PODS 2016, San Francisco, CA, USA, June 26 - July 01, 2016, ACM, 2016, pp. 13–28
work page 2016
-
[5]
M. Abo Khamis, H. Q. Ngo, and D. Suciu,Computing join queries with functional dependencies, in Proceedings of the 35th ACM PODS 2016, San Francisco, CA, USA, June 26 - July 01, 2016, ACM, 2016, pp. 327–342
work page 2016
-
[6]
M. Abo Khamis, H. Q. Ngo, and D. Suciu,What Do Shannon-type Inequalities, Submodular Width, and Disjunctive Datalog Have to Do with One Another?, in Proceedings of the 36th ACM PODS 2017, Chicago, IL, USA, May 14-19, 2017, ACM, 2017, pp. 429–444. Extended version available athttp://arxiv.org/abs/1612.02503
-
[7]
M. Abo Khamis, H. Q. Ngo, and D. Suciu,PANDA: Query Evaluation in Submodular Width, TheoretiCS, Volume 4 (2025)
work page 2025
-
[8]
M. Abo Khamis, H. Q. Ngo, and D. Suciu,PANDAExpress: a Simpler and Faster PANDA Algorithm, Proc. ACM Manag. Data, (2026). 14We also need to replace the maximum in Eq. (42) by the supremum since the set of entropic functions is not closed. 15In general, we cannot perform FMM over the ( min, +) semiring because there is no subtraction. Also, full queries ca...
work page 2026
-
[9]
Alon,On the number of subgraphs of prescribed type of graphs with a given number of edges, Israel J
N. Alon,On the number of subgraphs of prescribed type of graphs with a given number of edges, Israel J. Math., 38 (1981), pp. 116–130. [11]N. Alon, R. Yuster, and U. Zwick,Finding and counting given length cycles, Algorithmica, 17 (1997), pp. 209–223
work page 1981
-
[10]
A. Atserias, M. Grohe, and D. Marx,Size bounds and query plans for relational joins, SIAM J. Comput., 42 (2013), pp. 1737–1767
work page 2013
-
[11]
G. Bagan, A. Durand, and E. Grandjean,On acyclic conjunctive queries and constant delay enumeration, in Computer Science Logic, 21st International Workshop, CSL 2007, 16th Annual Conference of the EACSL, Lausanne, Switzerland, September 11-15, 2007, Proceedings, vol. 4646 of Lecture Notes in Computer Science, Springer, 2007, pp. 208–222
work page 2007
-
[12]
C. Berkholz and H. Vinall-Smeeth,Factorised Representations of Join Queries: Tight Bounds and a New Dichotomy, arXiv e-prints, (2025), p. arXiv:2503.20438
-
[13]
K. Bringmann and E. Gorbachev,A fine-grained classification of subquadratic patterns for subgraph listing and friends, in Proceedings of the 57th Annual ACM Symposium on Theory of Computing, STOC 2025, Prague, Czechia, June 23-27, 2025, ACM, 2025, pp. 2145–2156
work page 2025
-
[14]
T. H. Chan and R. W. Yeung,On a relation between information inequalities and group theory, IEEE Transactions on Information Theory, 48 (2002), pp. 1992–1995
work page 2002
-
[15]
A. K. Chandra and P. M. Merlin,Optimal implementation of conjunctive queries in relational data bases, in Proceedings of the 9th Annual ACM Symposium on Theory of Computing, May 4-6, 1977, Boulder, Colorado, USA, ACM, 1977, pp. 77–90
work page 1977
-
[16]
S. Chaudhuri,An overview of query optimization in relational systems, in Proceedings of the Seventeenth ACM SIGACT- SIGMOD-SIGART Symposium on Principles of Database Systems, June 1-3, 1998, Seattle, Washington, USA, ACM Press, 1998, pp. 34–43
work page 1998
-
[17]
F. R. K. Chung, R. L. Graham, P. Frankl, and J. B. Shearer,Some intersection theorems for ordered sets and graphs, J. Combin. Theory Ser. A, 43 (1986), pp. 23–37
work page 1986
-
[18]
M. Dalirrooyfard, S. Mathialagan, V. V. Williams, and Y. Xu,Towards optimal output-sensitive clique listing or: Listing cliques from smaller cliques, in Proceedings of the 56th Annual ACM Symposium on Theory of Computing, STOC 2024, New York, NY, USA, 2024, Association for Computing Machinery, p. 923–934
work page 2024
-
[19]
M. Dalirrooyfard, T. Vuong, and V. Vassilevska Williams,Graph pattern detection: Hardness for all induced patterns and faster noninduced cycles, SIAM J. Comput., 50 (2021), pp. 1627–1662. [22]R. Dechter,Bucket elimination: A unifying framework for reasoning, Artif. Intell., 113 (1999), pp. 41–85
work page 2021
-
[20]
,Bucket elimination: A unifying framework for reasoning, Artificial Intelligence, 113 (1999), pp. 41–85
work page 1999
-
[21]
K. Deeds and T. C. Merkl,Partition Constraints for Conjunctive Queries: Bounds and Worst-Case Optimal Joins, in 28th International Conference on Database Theory (ICDT 2025), vol. 328 of Leibniz International Proceedings in Informatics (LIPIcs), Dagstuhl, Germany, 2025, Schloss Dagstuhl – Leibniz-Zentrum f¨ ur Informatik, pp. 17:1–17:18
work page 2025
-
[22]
S. Deep, H. Zhao, A. Z. Fan, and P. Koutris,Output-sensitive conjunctive query evaluation, Proc. ACM Manag. Data, 2 (2024)
work page 2024
-
[23]
B. Ding, V. R. Narasayya, and S. Chaudhuri,Extensible query optimizers in practice, Found. Trends Databases, 14 (2024), pp. 186–402
work page 2024
-
[24]
A. Durand and S. Mengel,Structural tractability of counting of solutions to conjunctive queries, in Proceedings of the 16th International Conference on Database Theory, ICDT ’13, New York, NY, USA, 2013, Association for Computing Machinery, p. 81–92
work page 2013
-
[25]
A. Z. Fan, P. Koutris, and H. Zhao,The Fine-Grained Complexity of Boolean Conjunctive Queries and Sum-Product Problems, in 50th International Colloquium on Automata, Languages, and Programming (ICALP 2023), vol. 261 of Leibniz International Proceedings in Informatics (LIPIcs), Dagstuhl, Germany, 2023, Schloss Dagstuhl – Leibniz-Zentrum f¨ ur Informatik, p...
work page 2023
- [26]
-
[27]
E. Friedgut and J. Kahn,On the number of copies of one hypergraph in another, Israel J. Math., 105 (1998), pp. 251–256
work page 1998
-
[28]
,On the number of copies of one hypergraph in another, Israel Journal of Mathematics, 105 (1998), pp. 251–256
work page 1998
-
[29]
G. Gottlob, G. Greco, N. Leone, and F. Scarcello,Hypertree decompositions: Questions and answers, in Proceedings of the 35th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems, PODS 2016, San Francisco, CA, USA, June 26 - July 01, 2016, ACM, 2016, pp. 57–74
work page 2016
-
[30]
G. Gottlob, S. T. Lee, and G. Valiant,Size and treewidth bounds for conjunctive queries, in Proceedings of the Twenty-Eigth ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, PODS 2009, June 19 - July 1, 2009, Providence, Rhode Island, USA, ACM, 2009, pp. 45–54
work page 2009
-
[31]
G. Gottlob, S. T. Lee, G. Valiant, and P. Valiant,Size and treewidth bounds for conjunctive queries, J. ACM, 59 (2012), pp. 16:1–16:35
work page 2012
-
[32]
G. Gottlob, N. Leone, and F. Scarcello,Hypertree decompositions and tractable queries, Journal of Computer and System Sciences, 64 (2002), pp. 579–627
work page 2002
-
[33]
T. J. Green, G. Karvounarakis, and V. Tannen,Provenance semirings, in Proceedings of the Twenty-Sixth ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, PODS ’07, New York, NY, USA, 2007, Association for Computing Machinery, p. 31–40. 25
work page 2007
-
[34]
T. J. Green and V. Tannen,The semiring framework for database provenance, in Proceedings of the 36th ACM SIGMOD- SIGACT-SIGAI Symposium on Principles of Database Systems, PODS 2017, Chicago, IL, USA, May 14-19, 2017, ACM, 2017, pp. 93–99
work page 2017
-
[35]
M. Grohe and D. Marx,Constraint solving via fractional edge covers, in Proceedings of the Seventeenth Annual ACM- SIAM Symposium on Discrete Algorithms, SODA 2006, Miami, Florida, USA, January 22-26, 2006, ACM Press, 2006, pp. 289–298
work page 2006
-
[36]
,Constraint solving via fractional edge covers, ACM Trans. Algorithms, 11 (2014), pp. 4:1–4:20. [41]X. Hu,Output-optimal algorithms for join-aggregate queries, Proc. ACM Manag. Data, 3 (2025)
work page 2014
-
[37]
S. Im, B. Moseley, H. Q. Ngo, and K. Pruhs,Efficient algorithms for cardinality estimation and conjunctive query evaluation with simple degree constraints, Proc. ACM Manag. Data, 3 (2025), pp. 96:1–96:26
work page 2025
-
[38]
A. Itai and M. Rodeh,Finding a minimum circuit in a graph, in Proceedings of the 9th Annual ACM Symposium on Theory of Computing, May 4-6, 1977, Boulder, Colorado, USA, ACM, 1977, pp. 1–10
work page 1977
-
[39]
M. A. Khamis, X. Hu, and D. Suciu,Fast matrix multiplication meets the submodular width, Proc. ACM Manag. Data, 3 (2025), pp. 98:1–98:26
work page 2025
-
[40]
M. A. Khamis, H. Q. Ngo, C. R ´e, and A. Rudra,Joins via geometric resolutions: Worst case and beyond, ACM Trans. Database Syst., 41 (2016)
work page 2016
-
[41]
P. G. Kolaitis, A. E. Romashchenko, M. Studen ´y, D. Suciu, and T. A. Boege,Algorithmic Aspects of Information Theory (Dagstuhl Seminar 22301), Dagstuhl Reports, 12 (2023), pp. 180–204
work page 2023
-
[42]
P. G. Kolaitis and M. Y. Vardi,Conjunctive-query containment and constraint satisfaction, in Proceedings of the Seventeenth ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems, June 1-3, 1998, Seattle, Washington, USA, ACM Press, 1998, pp. 205–213. [48]D. Koller and N. Friedman,Probabilistic Graphical Models - Principles and Techniques, M...
work page 1998
-
[43]
V. Leis, A. Gubichev, A. Mirchev, P. A. Boncz, A. Kemper, and T. Neumann,How good are query optimizers, really?, Proc. VLDB Endow., 9 (2015), pp. 204–215. [50]D. Maier,Theory of Relational Databases, Computer Science Pr, 1983
work page 2015
-
[44]
Marx,Tractable Hypergraph Properties for Constraint Satisfaction and Conjunctive Queries, J
D. Marx,Tractable Hypergraph Properties for Constraint Satisfaction and Conjunctive Queries, J. ACM, 60 (2013), pp. 42:1–42:51
work page 2013
-
[45]
H. Q. Ngo, E. Porat, C. R ´e, and A. Rudra,Worst-case optimal join algorithms: [extended abstract], in Proceedings of the 31st ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, PODS 2012, Scottsdale, AZ, USA, May 20-24, 2012, ACM, 2012, pp. 37–48. [53]H. Q. Ngo, E. Porat, C. R ´e, and A. Rudra,Worst-case optimal join algorithms, J. ACM...
work page 2012
-
[46]
H. Q. Ngo, C. R ´e, and A. Rudra,Skew strikes back: new developments in the theory of join algorithms, SIGMOD Rec., 42 (2013), pp. 5–16. [55]R. Ramakrishnan and J. Gehrke,Database management systems (3. ed.), McGraw-Hill, 2003
work page 2013
-
[47]
T. L. Veldhuizen,Triejoin: A simple, worst-case optimal join algorithm, in Proc. 17th International Conference on Database Theory (ICDT), Athens, Greece, March 24-28, 2014, OpenProceedings.org, 2014, pp. 96–106
work page 2014
-
[48]
S. Vikneshwar Mani Jayaraman, C. Ropell, and A. Rudra,Worst-case Optimal Binary Join Algorithms under General ℓp Constraints, arXiv e-prints, (2021), p. arXiv:2112.01003
-
[49]
V. V. Williams, Y. Xu, Z. Xu, and R. Zhou,New Bounds for Matrix Multiplication: from Alpha to Omega, pp. 3792–3835
-
[50]
M. Yannakakis,Algorithms for acyclic database schemes, in Very Large Data Bases, 7th International Conference, September 9-11, 1981, Cannes, France, Proceedings, IEEE Computer Society, 1981, pp. 82–94
work page 1981
-
[51]
R. Yuster and U. Zwick,Detecting short directed cycles using rectangular matrix multiplication and dynamic programming., in SODA, vol. 4, 2004, pp. 254–260
work page 2004
-
[52]
N. L. Zhang and D. Poole,Exploiting causal independence in bayesian network inference, J. Artif. Int. Res., 5 (1996), p. 301–328
work page 1996
-
[53]
N. X. Zhang and D. Poole,A simple approach to bayesian network computations, in Proceedings of the Canadian AI Conference, Springer, 1994, pp. 207–214
work page 1994
-
[54]
Z. Zhang and R. W. Yeung,A non-shannon-type conditional inequality of information quantities, IEEE Trans. Information Theory, 43 (1997), pp. 1982–1986
work page 1997
-
[55]
Z. Zhang and R. W. Yeung,On characterization of entropy function via information inequalities, IEEE Transactions on Information Theory, 44 (1998), pp. 1440–1452. 26
work page 1998
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.