pith. sign in

arxiv: 2603.12560 · v3 · submitted 2026-03-13 · 💻 cs.DB

Towards Output-Optimal Uniform Sampling and Approximate Counting for Join-Project Queries

Pith reviewed 2026-05-15 12:22 UTC · model grok-4.3

classification 💻 cs.DB
keywords join-project queriesuniform samplingapproximate countingcommunication complexityrejection samplingoutput optimalitydatabase query processing
0
0 comments X

The pith

First asymptotically optimal algorithms for uniform sampling and approximate counting on matrix, star, and chain join-project queries

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops the first asymptotically optimal algorithms for uniform sampling and approximate counting on join-project queries for the classes of matrix, star, and chain shapes. Earlier propose-and-verify methods incurred high costs whenever projections shrank the output size relative to the join. A rejection-based sampling strategy paired with hybrid counting reductions delivers polynomial improvements in running time. Matching lower bounds from communication complexity establish that the results are tight, and these bounds continue to apply even when fast matrix multiplication is allowed. The work is relevant because join-project queries appear often in database workloads and better primitives directly speed up query optimization and approximate query processing.

Core claim

We present the first asymptotically optimal algorithms for fundamental classes of join-project queries, including matrix, star, and chain queries. By leveraging a novel rejection-based sampling strategy and a hybrid counting reduction, we achieve polynomial speedups over the state of the art. We establish the optimality of our results through matching communication complexity lower bounds, which hold even against algebraic techniques like fast matrix multiplication.

What carries the argument

Rejection-based sampling strategy combined with a hybrid counting reduction that produces output-optimal running time

If this is right

  • Matrix and star queries admit efficient sublinear-time algorithms.
  • Sublinear algorithms are impossible for chain queries in general.
  • The lower bounds remain valid even when fast matrix multiplication is permitted.
  • Optimal algorithms for unrestricted join-project queries stay open.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The rejection technique may extend to other query shapes that share similar structural reduction properties.
  • Communication-complexity arguments could set limits for sampling in distributed database systems.
  • Improved primitives may raise performance in practical approximate query processing engines.

Load-bearing premise

The optimality claims and polynomial speedups hold only for the restricted classes of matrix, star, and chain queries.

What would settle it

An algorithm that samples or counts chain join-project queries in sublinear time, or a communication-complexity lower bound that the presented upper bound fails to match

Figures

Figures reproduced from arXiv: 2603.12560 by Jinchao Huang, Xiao Hu.

Figure 1
Figure 1. Figure 1: Comparison between prior results and our new results (in red) in the property testing model. [PITH_FULL_IMAGE:figures/full_fig_p005_1.png] view at source ↗
read the original abstract

Uniform sampling and approximate counting are fundamental primitives for modern database applications, ranging from query optimization to approximate query processing. While recent breakthroughs have established optimal sampling and counting algorithms for full join queries, a significant gap remains for join-project queries, which are ubiquitous in real-world workloads. The state-of-the-art ``propose-and-verify'' framework \cite{chen2020random} for these queries suffers from fundamental inefficiencies, often yielding prohibitive complexity when projections significantly reduce the output size. In this paper, we present the first asymptotically optimal algorithms for fundamental classes of join-project queries, including matrix, star, and chain queries. By leveraging a novel rejection-based sampling strategy and a hybrid counting reduction, we achieve polynomial speedups over the state of the art. We establish the optimality of our results through matching communication complexity lower bounds, which hold even against algebraic techniques like fast matrix multiplication. Finally, we delineate the theoretical limits of the problem space. While matrix and star queries admit efficient sublinear-time algorithms, we establish a significantly stronger lower bound for chain queries, demonstrating that sublinear algorithms are impossible in general.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

0 major / 3 minor

Summary. The paper claims to present the first asymptotically optimal algorithms for uniform sampling and approximate counting over fundamental classes of join-project queries (matrix, star, and chain). It introduces a rejection-based sampling strategy and a hybrid counting reduction that yield polynomial speedups relative to the prior propose-and-verify framework, and proves optimality by exhibiting matching communication-complexity lower bounds that hold even against algebraic methods such as fast matrix multiplication. The work further delineates theoretical limits, showing that sublinear-time algorithms exist for matrix and star queries but are impossible for chain queries in general.

Significance. If the claimed upper and lower bounds hold, the result is significant: it closes a long-standing gap between full-join and join-project primitives that are ubiquitous in database workloads. The matching communication-complexity bounds supply tight characterizations for the restricted classes, the rejection-sampling and hybrid-counting constructions are technically novel, and the stronger impossibility result for chain queries usefully maps the boundary of what is possible. These contributions advance both the theory and the practical design of sampling/counting engines.

minor comments (3)
  1. Abstract: the phrase 'polynomial speedups' is stated without concrete exponents or degree; adding the precise improvement factors (e.g., from O(n^{2}) to O(n^{1.5})) would strengthen the claim.
  2. §1 (Introduction): the communication-complexity lower-bound argument is summarized at a high level; a short paragraph recalling the precise reduction (e.g., from set-disjointness or index) would help readers verify the matching claim without consulting the full proof.
  3. References: the citation chen2020random appears in the abstract but is not expanded in the provided reference list; ensure the full bibliographic entry is present.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for their positive assessment of the paper, accurate summary of our contributions on asymptotically optimal uniform sampling and approximate counting for join-project queries, and recommendation for minor revision. No major comments were raised in the report.

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper derives asymptotically optimal algorithms for uniform sampling and approximate counting on restricted classes of join-project queries (matrix, star, chain) via a rejection-based sampling strategy and hybrid counting reduction. Optimality is established by matching lower bounds drawn from external communication-complexity arguments that apply even to algebraic methods such as fast matrix multiplication; these bounds are independent of the paper's own constructions and do not reduce to fitted parameters or self-referential quantities. The manuscript explicitly scopes its positive results to the listed query classes, states that general join-project queries remain open, and proves a stronger impossibility result for sublinear algorithms on chain queries. No derivation step collapses by construction to the inputs, and no load-bearing claim relies on self-citation chains or ansatzes smuggled from prior work by the same authors.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review supplies no explicit free parameters, axioms, or invented entities; the work appears to rest on standard complexity-theoretic assumptions and communication-complexity reductions.

pith-pipeline@v0.9.0 · 5490 in / 1040 out tokens · 25458 ms · 2026-05-15T12:22:32.917803+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

40 extracted references · 40 canonical work pages

  1. [1]

    Acharya, P

    S. Acharya, P. B. Gibbons, V. Poosala, and S. Ramaswamy. Join synopses for approximate query answering. InSIGMOD, pages 275–286, 1999

  2. [2]

    Agarwal, B

    S. Agarwal, B. Mozafari, A. Panda, H. Milner, S. Madden, and I. Stoica. Blinkdb: queries with bounded errors and bounded response times on very large data. InProceedings of the 8th ACM European Conference on Computer Systems, pages 29–42. ACM, 2013

  3. [3]

    N. Alon, R. Yuster, and U. Zwick. Color-coding.Journal of the ACM (JACM), 42(4):844–856, 1995

  4. [4]

    R. R. Amossen and R. Pagh. Faster join-projects and sparse matrix multiplications. InICDT, pages 121–126. ACM, 2009

  5. [5]

    Arenas, L

    M. Arenas, L. A. Croquevielle, R. Jayaram, and C. Riveros. When is approximate counting for conjunctive queries tractable? InProceedings of the 53rd Annual ACM SIGACT Symposium on Theory of Computing, pages 1015–1027, 2021. Towards Output-Optimal Uniform Sampling and Approximate Counting for Join-Project Queries 31

  6. [6]

    Assadi, M

    S. Assadi, M. Kapralov, and S. Khanna. A Simple Sublinear-Time Algorithm for Counting Arbitrary Subgraphs via Edge Sampling. InProc. Innovations in Theoretical Computer Science, pages 6:1–6:20, 2019

  7. [7]

    Atserias, M

    A. Atserias, M. Grohe, and D. Marx. Size bounds and query plans for relational joins.SIAM Journal on Computing, 42 (4):1737–1767, 2013

  8. [8]

    Bagan, A

    G. Bagan, A. Durand, and E. Grandjean. On acyclic conjunctive queries and constant delay enumeration. InInternational Workshop on Computer Science Logic, pages 208–222. Springer, 2007

  9. [9]

    Bar-Yossef, T

    Z. Bar-Yossef, T. S. Jayram, R. Kumar, and D. Sivakumar. Information theory methods in communication complexity. InProceedings 17th IEEE Annual Conference on Computational Complexity, pages 93–102. IEEE, 2002

  10. [10]

    Censor-Hillel, T

    K. Censor-Hillel, T. Even, and V. Vassilevska Williams. Fast approximate counting of cycles. InICALP 2024, 2024

  11. [11]

    Censor-Hillel, T

    K. Censor-Hillel, T. Even, and V. Vassilevska Williams. Output-sensitive approximate counting via a measure-bounded hyperedge oracle, or: How asymmetry helps estimate k-clique counts faster. InProceedings of the 57th Annual ACM Symposium on Theory of Computing, pages 1985–1996, 2025

  12. [12]

    Chaudhuri, R

    S. Chaudhuri, R. Motwani, and V. Narasayya. On random sampling over joins.ACM SIGMOD Record, 28(2):263–274, 1999

  13. [13]

    Chen and K

    Y. Chen and K. Yi. Random sampling and size estimation over cyclic joins. InICDT, 2020

  14. [14]

    Cohen and H

    E. Cohen and H. Kaplan. Summarizing data using bottom-k sketches. In I. Gupta and R. Wattenhofer, editors,Proc. ACM Symposium on Principles of Distributed Computing, pages 225–234. ACM, 2007

  15. [15]

    Cohen and H

    E. Cohen and H. Kaplan. Tighter estimation using bottom k sketches.Proceedings of the VLDB Endowment, 1(1): 213–224, 2008

  16. [16]

    T. H. Cormen, C. E. Leiserson, R. L. Rivest, and C. Stein.Introduction to algorithms. MIT press, 2022

  17. [17]

    Curticapean, H

    R. Curticapean, H. Dell, and D. Marx. Homomorphisms are a good basis for counting small subgraphs. InProceedings of the 49th Annual ACM SIGACT Symposium on Theory of Computing, pages 210–223, 2017

  18. [18]

    Dalirrooyfard, S

    M. Dalirrooyfard, S. Mathialagan, V. Vassilevska Williams, and Y. Xu. Towards optimal output-sensitive clique listing or: Listing cliques from smaller cliques. InProceedings of the 56th Annual ACM Symposium on Theory of Computing, STOC 2024, pages 923–934, 2024

  19. [19]

    S. Deng, S. Lu, and Y. Tao. On join sampling and hardness of combinatorial output-sensitive join algorithms. InPODS, 2023

  20. [20]

    T. Eden, A. Levi, D. Ron, and C. Seshadhri. Approximately counting triangles in sublinear time.SIAM Journal on Computing, 46(5):1603–1646, 2017

  21. [21]

    Eisenbrand and F

    F. Eisenbrand and F. Grandoni. On the complexity of fixed parameter clique and dominating set.Theoretical Computer Science, 326(1-3):57–67, 2004

  22. [22]

    R. A. Fisher and F. Yates.Statistical tables for biological, agricultural and medical research. Hafner Publishing Company, 1953

  23. [23]

    X. Hu. Cover or pack: New upper and lower bounds for massively parallel joins. InProceedings of the 40th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems, pages 181–198, 2021

  24. [24]

    Hu and K

    X. Hu and K. Yi. Parallel algorithms for sparse matrix multiplication and join-aggregate queries. InProc. ACM Symposium on Principles of Database Systems, pages 411–425, 2020

  25. [25]

    Huang, Y

    J. Huang, Y. Tao, and S. Wang. Acyclic join sampling under selections: Dichotomy, union sampling, and enumeration. InProc. International Conference on Database Theory, 2026

  26. [26]

    N. L. Johnson, A. W. Kemp, and S. Kotz.Univariate discrete distributions. John Wiley & Sons, 2005

  27. [27]

    Kalyanasundaram and G

    B. Kalyanasundaram and G. Schintger. The probabilistic communication complexity of set intersection.SIAM Journal on Discrete Mathematics, 5(4):545–557, 1992

  28. [28]

    K. Kim, J. Ha, G. Fletcher, and W.-S. Han. Guaranteeing the ˜𝑂(AGM/OUT) runtime for uniform sampling and size estimation over joins. InPODS, page 113–125, 2023

  29. [29]

    Kloks, D

    T. Kloks, D. Kratsch, and H. Müller. Finding and counting small induced subgraphs efficiently.Inf. Process. Lett., 74 (3-4):115–121, 2000

  30. [30]

    D. E. Knuth.The art of computer programming, Volume II: Seminumerical Algorithms, 3rd Edition. Addison-Wesley Professional, 1998

  31. [31]

    Kowaluk, A

    M. Kowaluk, A. Lingas, and E. Lundell. Counting and detecting small subgraphs via equations.SIAM J. Discret. Math., 27(2):892–909, 2013

  32. [32]

    H. Q. Ngo. Worst-case optimal join algorithms: Techniques, results, and open problems. InProc. ACM Symposium on Principles of Database Systems, pages 111–124. ACM, 2018

  33. [33]

    H. Q. Ngo, E. Porat, C. Ré, and A. Rudra. Worst-case optimal join algorithms. InProc. ACM Symposium on Principles of Database Systems, pages 37–48, 2012

  34. [34]

    H. Q. Ngo, C. Ré, and A. Rudra. Skew strikes back: new developments in the theory of join algorithms.Acm Sigmod Record, 42(4):5–16, 2014. 32 X. Hu and J. Huang Algorithm 13:SampleJoinProjectpRq Input: An instance R for the join-project query Q“ pV,E, yq, with some auxiliary indices built forRif needed. Output:A uniform sample of the query resultQpRq. 1whi...

  35. [35]

    Olken and D

    F. Olken and D. Rotem. Simple random sampling from relational databases. InProceedings of the VLDB Endowment, pages 160–169, 1986

  36. [36]

    Olken and D

    F. Olken and D. Rotem. Random sampling from databases: a survey.Statistics and Computing, 5(1):25–42, 1995

  37. [37]

    A. A. Razborov. On the distributional complexity of disjointness. InInternational Colloquium on Automata, Languages, and Programming, pages 249–253. Springer, 1990

  38. [38]

    J. Tětek. Approximate Triangle Counting via Sampling and Fast Matrix Multiplication. In49th International Colloquium on Automata, Languages, and Programming (ICALP 2022), pages 107:1–107:20, 2022

  39. [39]

    Yannakakis

    M. Yannakakis. Algorithms for acyclic database schemes. InProc. International Conference on Very Large Data Bases, pages 82–94, 1981

  40. [40]

    AGMpQ ¯y,t|𝑅 𝑒 | : 𝑒PEuq . Note that 𝜋 V´y𝑠 is a valid join result of Q¯ypR𝑠 q (a “success

    Z. Zhao, R. Christensen, F. Li, X. Hu, and K. Yi. Random sampling over joins revisited. InSIGMOD, pages 1525–1539, 2018. A Extension to General Join-Project Queries Our framework introduced for matrix and star queries can be easily extended to arbitrary join- project queries. We first recall the AGM bound [7]. For a join query 𝑞“ pV,Eq , letN: EÑZ ` be th...