Towards Output-Optimal Uniform Sampling and Approximate Counting for Join-Project Queries
Pith reviewed 2026-05-15 12:22 UTC · model grok-4.3
The pith
First asymptotically optimal algorithms for uniform sampling and approximate counting on matrix, star, and chain join-project queries
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
We present the first asymptotically optimal algorithms for fundamental classes of join-project queries, including matrix, star, and chain queries. By leveraging a novel rejection-based sampling strategy and a hybrid counting reduction, we achieve polynomial speedups over the state of the art. We establish the optimality of our results through matching communication complexity lower bounds, which hold even against algebraic techniques like fast matrix multiplication.
What carries the argument
Rejection-based sampling strategy combined with a hybrid counting reduction that produces output-optimal running time
If this is right
- Matrix and star queries admit efficient sublinear-time algorithms.
- Sublinear algorithms are impossible for chain queries in general.
- The lower bounds remain valid even when fast matrix multiplication is permitted.
- Optimal algorithms for unrestricted join-project queries stay open.
Where Pith is reading between the lines
- The rejection technique may extend to other query shapes that share similar structural reduction properties.
- Communication-complexity arguments could set limits for sampling in distributed database systems.
- Improved primitives may raise performance in practical approximate query processing engines.
Load-bearing premise
The optimality claims and polynomial speedups hold only for the restricted classes of matrix, star, and chain queries.
What would settle it
An algorithm that samples or counts chain join-project queries in sublinear time, or a communication-complexity lower bound that the presented upper bound fails to match
Figures
read the original abstract
Uniform sampling and approximate counting are fundamental primitives for modern database applications, ranging from query optimization to approximate query processing. While recent breakthroughs have established optimal sampling and counting algorithms for full join queries, a significant gap remains for join-project queries, which are ubiquitous in real-world workloads. The state-of-the-art ``propose-and-verify'' framework \cite{chen2020random} for these queries suffers from fundamental inefficiencies, often yielding prohibitive complexity when projections significantly reduce the output size. In this paper, we present the first asymptotically optimal algorithms for fundamental classes of join-project queries, including matrix, star, and chain queries. By leveraging a novel rejection-based sampling strategy and a hybrid counting reduction, we achieve polynomial speedups over the state of the art. We establish the optimality of our results through matching communication complexity lower bounds, which hold even against algebraic techniques like fast matrix multiplication. Finally, we delineate the theoretical limits of the problem space. While matrix and star queries admit efficient sublinear-time algorithms, we establish a significantly stronger lower bound for chain queries, demonstrating that sublinear algorithms are impossible in general.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims to present the first asymptotically optimal algorithms for uniform sampling and approximate counting over fundamental classes of join-project queries (matrix, star, and chain). It introduces a rejection-based sampling strategy and a hybrid counting reduction that yield polynomial speedups relative to the prior propose-and-verify framework, and proves optimality by exhibiting matching communication-complexity lower bounds that hold even against algebraic methods such as fast matrix multiplication. The work further delineates theoretical limits, showing that sublinear-time algorithms exist for matrix and star queries but are impossible for chain queries in general.
Significance. If the claimed upper and lower bounds hold, the result is significant: it closes a long-standing gap between full-join and join-project primitives that are ubiquitous in database workloads. The matching communication-complexity bounds supply tight characterizations for the restricted classes, the rejection-sampling and hybrid-counting constructions are technically novel, and the stronger impossibility result for chain queries usefully maps the boundary of what is possible. These contributions advance both the theory and the practical design of sampling/counting engines.
minor comments (3)
- Abstract: the phrase 'polynomial speedups' is stated without concrete exponents or degree; adding the precise improvement factors (e.g., from O(n^{2}) to O(n^{1.5})) would strengthen the claim.
- §1 (Introduction): the communication-complexity lower-bound argument is summarized at a high level; a short paragraph recalling the precise reduction (e.g., from set-disjointness or index) would help readers verify the matching claim without consulting the full proof.
- References: the citation chen2020random appears in the abstract but is not expanded in the provided reference list; ensure the full bibliographic entry is present.
Simulated Author's Rebuttal
We thank the referee for their positive assessment of the paper, accurate summary of our contributions on asymptotically optimal uniform sampling and approximate counting for join-project queries, and recommendation for minor revision. No major comments were raised in the report.
Circularity Check
No significant circularity detected
full rationale
The paper derives asymptotically optimal algorithms for uniform sampling and approximate counting on restricted classes of join-project queries (matrix, star, chain) via a rejection-based sampling strategy and hybrid counting reduction. Optimality is established by matching lower bounds drawn from external communication-complexity arguments that apply even to algebraic methods such as fast matrix multiplication; these bounds are independent of the paper's own constructions and do not reduce to fitted parameters or self-referential quantities. The manuscript explicitly scopes its positive results to the listed query classes, states that general join-project queries remain open, and proves a stronger impossibility result for sublinear algorithms on chain queries. No derivation step collapses by construction to the inputs, and no load-bearing claim relies on self-citation chains or ansatzes smuggled from prior work by the same authors.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
S. Acharya, P. B. Gibbons, V. Poosala, and S. Ramaswamy. Join synopses for approximate query answering. InSIGMOD, pages 275–286, 1999
work page 1999
-
[2]
S. Agarwal, B. Mozafari, A. Panda, H. Milner, S. Madden, and I. Stoica. Blinkdb: queries with bounded errors and bounded response times on very large data. InProceedings of the 8th ACM European Conference on Computer Systems, pages 29–42. ACM, 2013
work page 2013
-
[3]
N. Alon, R. Yuster, and U. Zwick. Color-coding.Journal of the ACM (JACM), 42(4):844–856, 1995
work page 1995
-
[4]
R. R. Amossen and R. Pagh. Faster join-projects and sparse matrix multiplications. InICDT, pages 121–126. ACM, 2009
work page 2009
-
[5]
M. Arenas, L. A. Croquevielle, R. Jayaram, and C. Riveros. When is approximate counting for conjunctive queries tractable? InProceedings of the 53rd Annual ACM SIGACT Symposium on Theory of Computing, pages 1015–1027, 2021. Towards Output-Optimal Uniform Sampling and Approximate Counting for Join-Project Queries 31
work page 2021
- [6]
-
[7]
A. Atserias, M. Grohe, and D. Marx. Size bounds and query plans for relational joins.SIAM Journal on Computing, 42 (4):1737–1767, 2013
work page 2013
- [8]
-
[9]
Z. Bar-Yossef, T. S. Jayram, R. Kumar, and D. Sivakumar. Information theory methods in communication complexity. InProceedings 17th IEEE Annual Conference on Computational Complexity, pages 93–102. IEEE, 2002
work page 2002
-
[10]
K. Censor-Hillel, T. Even, and V. Vassilevska Williams. Fast approximate counting of cycles. InICALP 2024, 2024
work page 2024
-
[11]
K. Censor-Hillel, T. Even, and V. Vassilevska Williams. Output-sensitive approximate counting via a measure-bounded hyperedge oracle, or: How asymmetry helps estimate k-clique counts faster. InProceedings of the 57th Annual ACM Symposium on Theory of Computing, pages 1985–1996, 2025
work page 1985
-
[12]
S. Chaudhuri, R. Motwani, and V. Narasayya. On random sampling over joins.ACM SIGMOD Record, 28(2):263–274, 1999
work page 1999
-
[13]
Y. Chen and K. Yi. Random sampling and size estimation over cyclic joins. InICDT, 2020
work page 2020
-
[14]
E. Cohen and H. Kaplan. Summarizing data using bottom-k sketches. In I. Gupta and R. Wattenhofer, editors,Proc. ACM Symposium on Principles of Distributed Computing, pages 225–234. ACM, 2007
work page 2007
-
[15]
E. Cohen and H. Kaplan. Tighter estimation using bottom k sketches.Proceedings of the VLDB Endowment, 1(1): 213–224, 2008
work page 2008
-
[16]
T. H. Cormen, C. E. Leiserson, R. L. Rivest, and C. Stein.Introduction to algorithms. MIT press, 2022
work page 2022
-
[17]
R. Curticapean, H. Dell, and D. Marx. Homomorphisms are a good basis for counting small subgraphs. InProceedings of the 49th Annual ACM SIGACT Symposium on Theory of Computing, pages 210–223, 2017
work page 2017
-
[18]
M. Dalirrooyfard, S. Mathialagan, V. Vassilevska Williams, and Y. Xu. Towards optimal output-sensitive clique listing or: Listing cliques from smaller cliques. InProceedings of the 56th Annual ACM Symposium on Theory of Computing, STOC 2024, pages 923–934, 2024
work page 2024
-
[19]
S. Deng, S. Lu, and Y. Tao. On join sampling and hardness of combinatorial output-sensitive join algorithms. InPODS, 2023
work page 2023
-
[20]
T. Eden, A. Levi, D. Ron, and C. Seshadhri. Approximately counting triangles in sublinear time.SIAM Journal on Computing, 46(5):1603–1646, 2017
work page 2017
-
[21]
F. Eisenbrand and F. Grandoni. On the complexity of fixed parameter clique and dominating set.Theoretical Computer Science, 326(1-3):57–67, 2004
work page 2004
-
[22]
R. A. Fisher and F. Yates.Statistical tables for biological, agricultural and medical research. Hafner Publishing Company, 1953
work page 1953
-
[23]
X. Hu. Cover or pack: New upper and lower bounds for massively parallel joins. InProceedings of the 40th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems, pages 181–198, 2021
work page 2021
- [24]
- [25]
-
[26]
N. L. Johnson, A. W. Kemp, and S. Kotz.Univariate discrete distributions. John Wiley & Sons, 2005
work page 2005
-
[27]
B. Kalyanasundaram and G. Schintger. The probabilistic communication complexity of set intersection.SIAM Journal on Discrete Mathematics, 5(4):545–557, 1992
work page 1992
-
[28]
K. Kim, J. Ha, G. Fletcher, and W.-S. Han. Guaranteeing the ˜𝑂(AGM/OUT) runtime for uniform sampling and size estimation over joins. InPODS, page 113–125, 2023
work page 2023
- [29]
-
[30]
D. E. Knuth.The art of computer programming, Volume II: Seminumerical Algorithms, 3rd Edition. Addison-Wesley Professional, 1998
work page 1998
-
[31]
M. Kowaluk, A. Lingas, and E. Lundell. Counting and detecting small subgraphs via equations.SIAM J. Discret. Math., 27(2):892–909, 2013
work page 2013
-
[32]
H. Q. Ngo. Worst-case optimal join algorithms: Techniques, results, and open problems. InProc. ACM Symposium on Principles of Database Systems, pages 111–124. ACM, 2018
work page 2018
-
[33]
H. Q. Ngo, E. Porat, C. Ré, and A. Rudra. Worst-case optimal join algorithms. InProc. ACM Symposium on Principles of Database Systems, pages 37–48, 2012
work page 2012
-
[34]
H. Q. Ngo, C. Ré, and A. Rudra. Skew strikes back: new developments in the theory of join algorithms.Acm Sigmod Record, 42(4):5–16, 2014. 32 X. Hu and J. Huang Algorithm 13:SampleJoinProjectpRq Input: An instance R for the join-project query Q“ pV,E, yq, with some auxiliary indices built forRif needed. Output:A uniform sample of the query resultQpRq. 1whi...
work page 2014
-
[35]
F. Olken and D. Rotem. Simple random sampling from relational databases. InProceedings of the VLDB Endowment, pages 160–169, 1986
work page 1986
-
[36]
F. Olken and D. Rotem. Random sampling from databases: a survey.Statistics and Computing, 5(1):25–42, 1995
work page 1995
-
[37]
A. A. Razborov. On the distributional complexity of disjointness. InInternational Colloquium on Automata, Languages, and Programming, pages 249–253. Springer, 1990
work page 1990
-
[38]
J. Tětek. Approximate Triangle Counting via Sampling and Fast Matrix Multiplication. In49th International Colloquium on Automata, Languages, and Programming (ICALP 2022), pages 107:1–107:20, 2022
work page 2022
-
[39]
M. Yannakakis. Algorithms for acyclic database schemes. InProc. International Conference on Very Large Data Bases, pages 82–94, 1981
work page 1981
-
[40]
AGMpQ ¯y,t|𝑅 𝑒 | : 𝑒PEuq . Note that 𝜋 V´y𝑠 is a valid join result of Q¯ypR𝑠 q (a “success
Z. Zhao, R. Christensen, F. Li, X. Hu, and K. Yi. Random sampling over joins revisited. InSIGMOD, pages 1525–1539, 2018. A Extension to General Join-Project Queries Our framework introduced for matrix and star queries can be easily extended to arbitrary join- project queries. We first recall the AGM bound [7]. For a join query 𝑞“ pV,Eq , letN: EÑZ ` be th...
work page 2018
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.