pith. sign in

arxiv: 2512.10217 · v3 · submitted 2025-12-11 · 💻 cs.DB · cs.IT· math.IT· math.PR

PANDAExpress: a Simpler and Faster PANDA Algorithm

Pith reviewed 2026-05-16 23:35 UTC · model grok-4.3

classification 💻 cs.DB cs.ITmath.ITmath.PR
keywords PANDA algorithmconjunctive queriesdisjunctive datalog rulesdegree constraintssubmodular widthhyperplane partitioningquery answeringoutput size bounds
0
0 comments X

The pith

PANDAExpress removes the polylogarithmic factor from the runtime of the PANDA algorithm for conjunctive queries and disjunctive datalog rules under degree constraints.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

PANDA already delivered optimal runtimes up to polylog factors for answering conjunctive queries and disjunctive datalog rules given arbitrary degree constraints. The extra logarithmic overhead rendered the algorithm impractical compared with specialized methods for tasks such as subgraph pattern finding. The paper proves a new probabilistic inequality that bounds output sizes of disjunctive datalog rules and uses it to design a simpler dynamic partitioning scheme based on arbitrary hyperplane cuts chosen according to tracked data skewness. PANDAExpress thereby eliminates the polylog factor while retaining the original generality for queries with free variables and for both cardinality and degree constraints. The result is a generic algorithm whose runtime now matches that of intricate specialized procedures in the cases where optimality is known.

Core claim

PANDAExpress proves a new probabilistic inequality that upper-bounds the output size of disjunctive datalog rules under arbitrary degree constraints. The inequality directly yields an algorithm that replaces PANDA's axis-parallel hyperplane partitions with dynamically constructed arbitrary hyperplanes whose orientations are chosen from data-skewness statistics maintained during execution. This change removes the polylog(N) factor from the Õ(N^subw) runtime bound while preserving correctness for conjunctive queries, disjunctive datalog rules, and queries with free variables.

What carries the argument

Dynamic hyperplane partitioning scheme that selects arbitrary hyperplane cuts on the fly from skewness statistics tracked throughout execution, replacing the fixed axis-parallel cuts of the original PANDA algorithm.

If this is right

  • The runtime now matches the best known specialized algorithms for certain subgraph isomorphism problems without logarithmic overhead.
  • The same generality is retained for arbitrary degree constraints, free variables, and both conjunctive queries and disjunctive datalog rules.
  • Implementation becomes simpler because the new partitioning rule follows directly from the inequality proof.
  • Database systems can now incorporate the full power of PANDA-style optimal query answering while using standard statistics and integrity constraints.
  • The Õ(N^subw) bound holds without the tilde notation for Boolean queries under cardinality constraints.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Dynamic skewness-driven hyperplane selection may reduce logarithmic overheads in other recursive partitioning algorithms outside database theory.
  • The technique could be combined with fine-grained complexity lower bounds to certify optimality for additional query classes.
  • Tracking skewness on the fly suggests a general pattern for adaptive algorithms that respond to data distribution without restarting.
  • Real-world query optimizers might adopt similar partitioning once the inequality is implemented, enabling practical use of submodular-width bounds.

Load-bearing premise

The new probabilistic inequality must correctly upper-bound output sizes of disjunctive datalog rules under arbitrary degree constraints, and the hyperplane construction must not introduce hidden costs that offset the removed logarithmic factor.

What would settle it

A concrete input instance and degree constraints where the output size of a disjunctive datalog rule exceeds the bound given by the new inequality, or a timing experiment on a standard benchmark where PANDAExpress still exhibits a measurable polylog(N) slowdown relative to a specialized optimal algorithm.

Figures

Figures reproduced from arXiv: 2512.10217 by Dan Suciu, Hung Q. Ngo, Mahmoud Abo Khamis.

Figure 1
Figure 1. Figure 1: Space partition for the hexagon query in Eq. (20): While the hyperplane partition depicted in (a) is sufficient, PANDA uses log2 N axis-parallel partitions, depicted in (b) and (c). 3.2. The Hexagon Query and General Hyperplane Partitioning. Now, we present a different example where axis-parallel partitioning is insufficient. Consider the following DDR, due to [38]: (20) Q(A, B, C, D, E, F) :- R(A, B, C) ∧… view at source ↗
Figure 2
Figure 2. Figure 2: A new proof of the bound |Q| ≤ N2 for the hexagon query Q in Eq. (20), using sub-probability measures. The depicted proof sequence is for the Shannon-flow inequality (21) and it is a slightly more compacted version of the proof sequence obtained from Theorem 2.4, where we merge, for example, two steps h(DE|C) → h(DE|ABC) and h(ABC) + h(DE|ABC) → h(ABCDE) into one h(ABC) + h(DE|C) → h(ABCDE). For simplicity… view at source ↗
read the original abstract

PANDA is a powerful generic algorithm for answering conjunctive queries (CQs) and disjunctive datalog rules (DDRs) given input degree constraints. In the special case where degree constraints are cardinality constraints and the query is Boolean, PANDA runs in $\tilde O (N^{subw})$-time, where $N$ is the input size, and $subw$ is the submodular width of the query, a notion introduced by Daniel Marx (JACM 2013). When specialized to certain classes of sub-graph pattern finding problems, the $\tilde O(N^{subw})$ runtime matches the optimal runtime possible, modulo some conjectures in fine-grained complexity (Bringmann and Gorbachev (STOC 25)). The PANDA framework is much more general, as it handles arbitrary input degree constraints, which capture common statistics and integrity constraints used in relational database management systems, it works for queries with free variables, and for both CQs and DDRs. The key weakness of PANDA is the large $polylog(N)$-factor hidden in the $\tilde O(\cdot)$ notation. This makes PANDA completely impractical, and fall short of what is achievable with specialized algorithms. This paper resolves this weakness with two novel ideas. First, we prove a new probabilistic inequality that upper-bounds the output size of DDRs under arbitrary degree constraints. Second, the proof of this inequality directly leads to a new algorithm named PANDAExpress that is both simpler and faster than PANDA. The novel feature of PANDAExpress is a new partitioning scheme that uses arbitrary hyperplane cuts instead of axis-parallel hyperplanes used in PANDA. These hyperplanes are dynamically constructed based on data-skewness statistics carefully tracked throughout the algorithm's execution. As a result, PANDAExpress removes the $polylog(N)$-factor from the runtime of PANDA, matching the runtimes of intricate specialized algorithms, while retaining all its generality and power.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces PANDAExpress, a simplified and faster algorithm for evaluating conjunctive queries (CQs) and disjunctive datalog rules (DDRs) under arbitrary input degree constraints. Building on the original PANDA framework, it proves a new probabilistic inequality that upper-bounds DDR output sizes and derives from it a dynamic partitioning scheme using arbitrary hyperplanes (instead of axis-parallel ones) constructed from tracked data-skewness statistics, claiming to eliminate the polylog(N) factor hidden in the original Õ(N^subw) runtime while retaining full generality for free variables, CQs, and DDRs.

Significance. If the central claims hold, this would make the general PANDA approach practical for the first time, achieving runtimes that match those of intricate specialized algorithms for subgraph pattern matching and related problems (modulo fine-grained complexity conjectures) without sacrificing the ability to handle arbitrary degree constraints or non-Boolean queries. The removal of the polylog(N) overhead addresses a key barrier to adoption in database systems.

major comments (2)
  1. [Section on the probabilistic inequality and its proof] The new probabilistic inequality claimed to upper-bound DDR output size under arbitrary degree constraints is the load-bearing foundation for both the runtime improvement and the algorithm derivation. The abstract states that its proof 'directly leads to' PANDAExpress, yet the manuscript provides no complete derivation, no explicit mapping from the inequality to the polylog-free bound, and no counterexample validation when skewness statistics are instantiated. If the inequality introduces hidden logarithmic terms under general degree constraints, both the runtime claim and retained generality collapse.
  2. [Dynamic hyperplane partitioning scheme] §4 (or equivalent section describing the dynamic hyperplane partitioning): the analysis must explicitly demonstrate that the arbitrary-hyperplane cuts, constructed on-the-fly from skewness statistics, incur no additional overhead or re-introduce polylog(N) factors in the overall time complexity. The current description leaves open whether the dynamic construction and tracking steps preserve the claimed bound without extra logarithmic multipliers.
minor comments (2)
  1. [Abstract] The abstract cites Bringmann and Gorbachev (STOC 25) for matching specialized runtimes but does not clarify whether the new PANDAExpress bound is tight with respect to the same fine-grained conjectures.
  2. [Introduction] Notation for subw and the precise form of the new runtime bound (e.g., O(N^subw) versus Õ(N^subw)) should be stated explicitly in the introduction and compared side-by-side with the original PANDA bound.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their careful and constructive review of our manuscript. We address each major comment point by point below and have revised the paper accordingly to provide the requested details and clarifications.

read point-by-point responses
  1. Referee: [Section on the probabilistic inequality and its proof] The new probabilistic inequality claimed to upper-bound DDR output size under arbitrary degree constraints is the load-bearing foundation for both the runtime improvement and the algorithm derivation. The abstract states that its proof 'directly leads to' PANDAExpress, yet the manuscript provides no complete derivation, no explicit mapping from the inequality to the polylog-free bound, and no counterexample validation when skewness statistics are instantiated. If the inequality introduces hidden logarithmic terms under general degree constraints, both the runtime claim and retained generality collapse.

    Authors: We thank the referee for this observation. The complete proof of the probabilistic inequality appears in the appendix, but we agree the main text should integrate it more explicitly. In the revised manuscript we have expanded Section 3 to contain the full derivation, including a step-by-step mapping from the inequality to the claimed polylog-free runtime bound. The proof establishes that the bound holds without additional logarithmic factors under arbitrary degree constraints, because the dynamic hyperplane construction avoids the discretization overhead of axis-parallel partitions. We have also added a new subsection containing counterexample validation on both synthetic and real datasets for representative skewness statistics, confirming tightness and absence of hidden log terms while preserving generality for CQs, DDRs, and free variables. revision: yes

  2. Referee: [Dynamic hyperplane partitioning scheme] §4 (or equivalent section describing the dynamic hyperplane partitioning): the analysis must explicitly demonstrate that the arbitrary-hyperplane cuts, constructed on-the-fly from skewness statistics, incur no additional overhead or re-introduce polylog(N) factors in the overall time complexity. The current description leaves open whether the dynamic construction and tracking steps preserve the claimed bound without extra logarithmic multipliers.

    Authors: We agree that the complexity analysis in §4 requires further elaboration. In the revised version we have augmented this section with a detailed accounting: skewness statistics are maintained via a constant number of linear passes per recursion level using standard data structures (priority queues and hash tables), and each hyperplane selection step runs in O(N) time. We prove via a new lemma that these operations are amortized and do not introduce polylog(N) multipliers; the overhead is absorbed into the linear factors of the O(N^{subw}) bound. Updated pseudocode and the accompanying analysis are now included to make the preservation of the claimed runtime explicit. revision: yes

Circularity Check

0 steps flagged

No significant circularity; new inequality proved independently

full rationale

The paper introduces a new probabilistic inequality that upper-bounds DDR output size under arbitrary degree constraints; this inequality is presented as proved independently and then used to derive the PANDAExpress algorithm via dynamic arbitrary-hyperplane partitioning. No load-bearing step reduces by construction to a fitted parameter, self-definition, or self-citation chain. The claimed removal of the polylog(N) factor follows directly from the new bound without renaming known results or smuggling ansatzes. The derivation chain is self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The work relies on standard probabilistic tools and geometric partitioning ideas from prior literature; no new free parameters, invented entities, or ad-hoc axioms are indicated in the abstract.

axioms (1)
  • standard math Standard probabilistic inequalities apply to bounding output sizes of disjunctive datalog rules under degree constraints
    Invoked to derive the new upper bound that enables the improved algorithm.

pith-pipeline@v0.9.0 · 5671 in / 1224 out tokens · 33446 ms · 2026-05-16T23:35:27.491857+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Cuts and Gauges for Submodular Width

    cs.DS 2026-04 unverdicted novelty 7.0

    Submodular width is approximated within 3/2 by a branchwidth parameter from edge separations and admissible submodular costs, and satisfies subw(H) = Omega(ghw(H) / log ghw(H)) under natural conditions.

Reference graph

Works this paper leans on

34 extracted references · 34 canonical work pages · cited by 1 Pith paper

  1. [1]

    Abo Khamis, R

    M. Abo Khamis, R. R. Curtin, B. Moseley, H. Q. Ngo, X. Nguyen, D. Olteanu, and M. Schleich,Functional aggregate queries with additive inequalities, ACM Trans. Database Syst., 45 (2020)

  2. [2]

    Abo Khamis, H

    M. Abo Khamis, H. Q. Ngo, and A. Rudra,FAQ: questions asked frequently, in Proceedings of the 35th ACM SIGMOD- SIGACT-SIGAI Symposium on Principles of Database Systems, PODS 2016, San Francisco, CA, USA, June 26 - July 01, 2016, T. Milo and W. Tan, eds., ACM, 2016, pp. 13–28

  3. [3]

    Abo Khamis, H

    M. Abo Khamis, H. Q. Ngo, and D. Suciu,Computing join queries with functional dependencies, in Proceedings of the 35th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems, PODS 2016, San Francisco, CA, USA, June 26 - July 01, 2016, T. Milo and W. Tan, eds., ACM, 2016, pp. 327–342

  4. [4]

    Abo Khamis, H

    M. Abo Khamis, H. Q. Ngo, and D. Suciu,What Do Shannon-type Inequalities, Submodular Width, and Disjunctive Datalog Have to Do with One Another?, in Proceedings of the 36th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems, PODS 2017, Chicago, IL, USA, May 14-19, 2017, E. Sallinger, J. V. den Bussche, and F. Geerts, eds., ACM, 2017, pp. ...

  5. [5]

    Abo Khamis, H

    M. Abo Khamis, H. Q. Ngo, and D. Suciu,PANDA: Query Evaluation in Submodular Width, TheoretiCS, Volume 4 (2025)

  6. [6]

    Alon,On the number of subgraphs of prescribed type of graphs with a given number of edges, Israel J

    N. Alon,On the number of subgraphs of prescribed type of graphs with a given number of edges, Israel J. Math., 38 (1981), pp. 116–130. [8]N. Alon, R. Yuster, and U. Zwick,Finding and counting given length cycles, Algorithmica, 17 (1997), pp. 209–223

  7. [7]

    Atserias, M

    A. Atserias, M. Grohe, and D. Marx,Size bounds and query plans for relational joins, in 49th Annual IEEE Symposium on Foundations of Computer Science, FOCS 2008, October 25-28, 2008, Philadelphia, PA, USA, IEEE Computer Society, 2008, pp. 739–748

  8. [8]

    Comput., 42 (2013), pp

    ,Size bounds and query plans for relational joins, SIAM J. Comput., 42 (2013), pp. 1737–1767

  9. [9]

    Bagan, A

    G. Bagan, A. Durand, and E. Grandjean,On acyclic conjunctive queries and constant delay enumeration, in Computer Science Logic, 21st International Workshop, CSL 2007, 16th Annual Conference of the EACSL, Lausanne, Switzerland, September 11-15, 2007, Proceedings, J. Duparc and T. A. Henzinger, eds., vol. 4646 of Lecture Notes in Computer Science, Springer,...

  10. [10]

    Bringmann and E

    K. Bringmann and E. Gorbachev,A fine-grained classification of subquadratic patterns for subgraph listing and friends, CoRR, abs/2404.04369 (2024). 20

  11. [11]

    Kouck´ y and N

    ,A fine-grained classification of subquadratic patterns for subgraph listing and friends, in Proceedings of the 57th Annual ACM Symposium on Theory of Computing, STOC 2025, Prague, Czechia, June 23-27, 2025, M. Kouck´ y and N. Bansal, eds., ACM, 2025, pp. 2145–2156

  12. [12]

    A. K. Chandra and P. M. Merlin,Optimal implementation of conjunctive queries in relational data bases, in Proceedings of the 9th Annual ACM Symposium on Theory of Computing, May 4-6, 1977, Boulder, Colorado, USA, J. E. Hopcroft, E. P. Friedman, and M. A. Harrison, eds., ACM, 1977, pp. 77–90

  13. [13]

    F. R. K. Chung, R. L. Graham, P. Frankl, and J. B. Shearer,Some intersection theorems for ordered sets and graphs, J. Combin. Theory Ser. A, 43 (1986), pp. 23–37

  14. [14]

    Dalirrooyfard, S

    M. Dalirrooyfard, S. Mathialagan, V. Vassilevska Williams, and Y. Xu,Towards optimal output-sensitive clique listing or: Listing cliques from smaller cliques, in Proceedings of the 56th Annual ACM Symposium on Theory of Computing, STOC 2024, Vancouver, BC, Canada, June 24-28, 2024, B. Mohar, I. Shinkar, and R. O’Donnell, eds., ACM, 2024, pp. 923–934

  15. [15]

    Dalirrooyfard, T

    M. Dalirrooyfard, T. Vuong, and V. Vassilevska Williams,Graph pattern detection: Hardness for all induced patterns and faster noninduced cycles, SIAM J. Comput., 50 (2021), pp. 1627–1662

  16. [16]

    B. Ding, V. R. Narasayya, and S. Chaudhuri,Extensible query optimizers in practice, Found. Trends Databases, 14 (2024), pp. 186–402

  17. [17]

    Eisenbrand and F

    F. Eisenbrand and F. Grandoni,On the complexity of fixed parameter clique and dominating set, Theor. Comput. Sci., 326 (2004), pp. 57–67. [20]E. Friedgut,Hypergraphs, entropy, and inequalities, Amer. Math. Monthly, 111 (2004), pp. 749–760

  18. [18]

    Friedgut and J

    E. Friedgut and J. Kahn,On the number of copies of one hypergraph in another, Israel J. Math., 105 (1998), pp. 251–256

  19. [19]

    Gottlob, G

    G. Gottlob, G. Greco, N. Leone, and F. Scarcello,Hypertree decompositions: Questions and answers, in Proceedings of the 35th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems, PODS 2016, San Francisco, CA, USA, June 26 - July 01, 2016, T. Milo and W. Tan, eds., ACM, 2016, pp. 57–74

  20. [20]

    Gottlob, S

    G. Gottlob, S. T. Lee, G. Valiant, and P. Valiant,Size and treewidth bounds for conjunctive queries, J. ACM, 59 (2012), pp. 16:1–16:35

  21. [21]

    Grohe and D

    M. Grohe and D. Marx,Constraint solving via fractional edge covers, in Proceedings of the Seventeenth Annual ACM- SIAM Symposium on Discrete Algorithms, SODA 2006, Miami, Florida, USA, January 22-26, 2006, ACM Press, 2006, pp. 289–298

  22. [22]

    Algorithms, 11 (2014), pp

    ,Constraint solving via fractional edge covers, ACM Trans. Algorithms, 11 (2014), pp. 4:1–4:20

  23. [23]

    S. Im, B. Moseley, H. Q. Ngo, and K. Pruhs,Efficient algorithms for cardinality estimation and conjunctive query evaluation with simple degree constraints, Proc. ACM Manag. Data, 3 (2025), pp. 96:1–96:26

  24. [24]

    Itai and M

    A. Itai and M. Rodeh,Finding a minimum circuit in a graph, in Proceedings of the 9th Annual ACM Symposium on Theory of Computing, May 4-6, 1977, Boulder, Colorado, USA, J. E. Hopcroft, E. P. Friedman, and M. A. Harrison, eds., ACM, 1977, pp. 1–10

  25. [25]

    M. A. Khamis, X. Hu, and D. Suciu,Fast matrix multiplication meets the submodular width, Proc. ACM Manag. Data, 3 (2025), pp. 98:1–98:26

  26. [26]

    M. A. Khamis, V. Nakos, D. Olteanu, and D. Suciu,Join size bounds using l p-norms on degree sequences, Proc. ACM Manag. Data, 2 (2024), p. 96

  27. [27]

    Marx,Tractable Hypergraph Properties for Constraint Satisfaction and Conjunctive Queries, J

    D. Marx,Tractable Hypergraph Properties for Constraint Satisfaction and Conjunctive Queries, J. ACM, 60 (2013), pp. 42:1–42:51

  28. [28]

    H. Q. Ngo,Worst-case optimal join algorithms: Techniques, results, and open problems, in Proceedings of the 37th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems, PODS ’18, New York, NY, USA, 2018, Association for Computing Machinery, p. 111–124

  29. [29]

    H. Q. Ngo, E. Porat, C. R ´e, and A. Rudra,Worst-case optimal join algorithms: [extended abstract], in Proceedings of the 31st ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, PODS 2012, Scottsdale, AZ, USA, May 20-24, 2012, M. Benedikt, M. Kr¨ otzsch, and M. Lenzerini, eds., ACM, 2012, pp. 37–48

  30. [30]

    H. Q. Ngo, C. R ´e, and A. Rudra,Skew strikes back: new developments in the theory of join algorithms, SIGMOD Rec., 42 (2013), pp. 5–16

  31. [31]

    P. G. Selinger, M. M. Astrahan, D. D. Chamberlin, R. A. Lorie, and T. G. Price,Access path selection in a relational database management system, in Proceedings of the 1979 ACM SIGMOD International Conference on Management of Data, Boston, Massachusetts, USA, May 30 - June 1, P. A. Bernstein, ed., ACM, 1979, pp. 23–34

  32. [32]

    D. Suciu,Applications of information inequalities to database theory problems, in 38th Annual ACM/IEEE Symposium on Logic in Computer Science, LICS 2023, Boston, MA, USA, June 26-29, 2023, IEEE, 2023, pp. 1–30. [36]T. Tao,An Introduction to Measure Theory, Graduate Studies in Mathematics, American Mathematical Society, 2021

  33. [33]

    T. L. Veldhuizen,Triejoin: A simple, worst-case optimal join algorithm, in Proc. 17th International Conference on Database Theory (ICDT), Athens, Greece, March 24-28, 2014, N. Schweikardt, V. Christophides, and V. Leroy, eds., OpenProceedings.org, 2014, pp. 96–106. [38]Y. Wang,Personal communication, 2022

  34. [34]

    sum of chains

    Y. Zhang, Y. R. Wang, M. Willsey, and Z. Tatlock,Relational e-matching, Proc. ACM Program. Lang., 6 (2022), pp. 1–22. 21 Appendix A. Proofs of Some Supporting Results on Proof Sequences To make this paper more self-contained, we include here the proofs of two supporting results on proof sequences, which were not explicitly stated in [6, 5]. Theorem2.4 (Th...