Unbiased Insights: Optimal Streaming Algorithms for ell_p Sampling, the Forget Model, and Beyond
Pith reviewed 2026-05-18 23:46 UTC · model grok-4.3
The pith
New streaming samplers achieve near-optimal space for ell_p sampling and deliver nearly unbiased F_p estimates even when forget operations reset frequencies.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The central claim is that nearly space-optimal approximate ell_p samplers exist for p in (0,2) using O~(log n log(1/delta)) bits of space and for p=2 using O~(log^2 n log(1/delta)) bits, that these samplers extend to a continuous version outputting a valid index after every update, and that they yield nearly unbiased F_p estimators in streams containing forget operations, thereby giving near-optimal algorithms for all p in the forget model while also supporting entropy estimation in the suffix-prefix deletion model and arbitrary contraction functions g in G.
What carries the argument
The approximate ell_p sampler that maintains a distribution over indices proportional to |f_i|^p while surviving forget resets to produce nearly unbiased estimators.
If this is right
- Nearly unbiased F_p estimators exist for every p in the forget model.
- Near-optimal single-pass algorithms become available for all frequency-moment estimation tasks in the forget model.
- A valid sample index can be returned after every single update in the stream.
- Entropy can be estimated to near-optimal space in the suffix-prefix deletion model.
- Any coordinate-wise contraction function g in G can be applied while the stream is processed.
Where Pith is reading between the lines
- The same sampler primitives might extend to streams with both insertions and arbitrary deletions if the forget resets can be simulated.
- The log-factor space saving could translate into measurable reductions in memory footprint for large-scale distributed monitoring systems.
- Continuous sampling might simplify online decision procedures that need fresh ell_p samples at arbitrary times rather than only at the end of the stream.
Load-bearing premise
The forget operations reset frequencies exactly as specified and the sampler construction can absorb the resulting non-linear effects without extra assumptions on the arrival order or value distribution.
What would settle it
A stream containing forget resets on which the produced F_p estimator deviates from the true value by more than the claimed approximation factor with high probability, or a space lower bound proof showing that ell_p sampling for p in (0,2) requires asymptotically more than O(log n log(1/delta)) bits.
Figures
read the original abstract
We study $\ell_p$ sampling and frequency moment estimation in a single-pass insertion-only data stream. For $p \in (0,2)$, we present a nearly space-optimal approximate $\ell_p$ sampler that uses $\widetilde{O}(\log n \log(1/\delta))$ bits of space and for $p = 2$, we present a sampler with space complexity $\widetilde{O}(\log^2 n \log(1/\delta))$. This space complexity is optimal for $p \in (0, 2)$ and improves upon prior work by a $\log n$ factor. We further extend our construction to a continuous $\ell_p$ sampler, which outputs a valid sample index at every point during the stream. Leveraging these samplers, we design nearly unbiased estimators for $F_p$ in data streams that include forget operations, which reset individual element frequencies and introduce significant non-linear challenges. As a result, we obtain near-optimal algorithms for estimating $F_p$ for all $p$ in this model, originally proposed by Pavan, Chakraborty, Vinodchandran, and Meel [PODS'24], resolving all three open problems they posed. Furthermore, we generalize this model to what we call the suffix-prefix deletion model, and extend our techniques to estimate entropy as a corollary of our moment estimation algorithms. Finally, we show how to handle arbitrary coordinate-wise functions during the stream, for any $g \in \mathbb{G}$, where $\mathbb{G}$ includes all (linear or non-linear) contraction functions.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper presents nearly space-optimal approximate ℓ_p samplers for insertion-only streams: for p ∈ (0,2) the space is Õ(log n log(1/δ)) bits and for p=2 it is Õ(log² n log(1/δ)), claimed optimal for p<2 and improving prior work by a log n factor. It extends the sampler to a continuous version that outputs a valid sample at every point, then leverages these to obtain nearly unbiased F_p estimators in the forget model (resolving all three open problems from PODS'24). Further generalizations cover the suffix-prefix deletion model, entropy estimation, and arbitrary coordinate-wise contraction functions g ∈ G.
Significance. If the analyses hold, the work is significant for closing open questions on F_p estimation under forget operations and for the improved space bounds on ℓ_p sampling. Credit is due for adapting standard hashing and stable-distribution techniques to produce unbiased estimators without extra distributional assumptions, and for the direct resolution of the PODS'24 problems via the continuous sampler construction.
minor comments (2)
- [Introduction] §1, paragraph on prior work: the log n improvement is stated but the exact space bound from the PODS'24 paper is not quoted; adding the comparison would strengthen the claim.
- [Final section] The definition of the class G of contraction functions (mentioned in the abstract and final section) should include an explicit list or characterization of included non-linear functions to avoid ambiguity.
Simulated Author's Rebuttal
We thank the referee for the positive assessment of our work, the recognition of its significance in improving space bounds for ℓ_p sampling and resolving the open problems from PODS'24, and the recommendation for minor revision. We appreciate the credit given for adapting hashing and stable distribution techniques to obtain unbiased estimators.
Circularity Check
No significant circularity in derivation chain
full rationale
The paper's constructions for approximate ℓ_p samplers rely on standard hashing and stable-distribution techniques adapted to insertion-only streams, achieving the stated space bounds independently of the target results. Extensions to the forget model maintain continuous sampler state across resets to bound estimator bias, with the analysis drawing on the external PODS'24 definition of the model rather than redefining it. Optimality for p ∈ (0,2) is established by direct matching to known lower bounds from the literature, and resolution of the cited open problems follows from the unbiasedness properties without reducing to self-referential equations, fitted inputs renamed as predictions, or load-bearing self-citations. The derivation remains self-contained against external benchmarks and standard algorithmic primitives.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption The data stream is single-pass and insertion-only, with forget operations defined exactly as in the PODS'24 reference.
Reference graph
Works this paper leans on
-
[1]
Streaming Algorithms from Pre- cision Sampling
[AKO11] Alexandr Andoni, Robert Krauthgamer, and Krzysztof Onak. Streaming Algorithms from Pre- cision Sampling. InProceedings of the 2011 IEEE 52nd Annual Symposium on Foundations of Computer Science FOCS, page 363–372,
work page 2011
-
[2]
Chestnut, Nikita Ivkin, Jelani Nelson, Zhengyu Wang, and David P
[BCI+17] Vladimir Braverman, Stephen R. Chestnut, Nikita Ivkin, Jelani Nelson, Zhengyu Wang, and David P. Woodruff. BPTree: An 2 Heavy Hitters Algorithm Using Constant Memory. In Emanuel Sallinger, Jan Van den Bussche, and Floris Geerts, editors,Proceedings of the 36th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems, PODS 2017, Chicago...
work page 2017
-
[3]
[BVWY18] Vladimir Braverman, Emanuele Viola, David P. Woodruff, and Lin F. Yang. Revisiting Fre- quency Moment Estimation in Random Order Streams. In Ioannis Chatzigiannakis, Christos Kaklamanis, D´ aniel Marx, and Donald Sannella, editors,45th International Colloquium on Au- tomata, Languages, and Programming, ICALP 2018, July 9-13, 2018, Prague, Czech R...
work page 2018
-
[4]
Optimality of Frequency Moment Estimation
[BZ25] Mark Braverman and Or Zamir. Optimality of Frequency Moment Estimation. In Michal Kouck´ y and Nikhil Bansal, editors,Proceedings of the 57th Annual ACM Symposium on Theory of Computing, STOC 2025, Prague, Czechia, June 23-27, 2025, pages 360–370. ACM,
work page 2025
-
[5]
37 [CCD12] Edith Cohen, Graham Cormode, and Nick G. Duffield. Don’t let the negatives bring you down: sampling from streams of signed updates. In Peter G. Harrison, Martin F. Arlitt, and Giu- liano Casale, editors,ACM SIGMETRICS/PERFORMANCE Joint International Conference on Measurement and Modeling of Computer Systems, SIGMETRICS ’12, London, United King-...
work page 2012
-
[6]
Chen, and Martin Farach-Colton
[CCF02] Moses Charikar, Kevin C. Chen, and Martin Farach-Colton. Finding Frequent Items in Data Streams. In Peter Widmayer, Francisco Triguero Ruiz, Rafael Morales Bueno, Matthew Hen- nessy, Stephan J. Eidenbenz, and Ricardo Conejo, editors,Automata, Languages and Program- ming, 29th International Colloquium, ICALP 2002, Malaga, Spain, July 8-13, 2002, Pr...
work page 2002
-
[7]
Near-Optimal Lower Bounds on the Multi-Party Communication Complexity of Set Disjointness
[CKS03] Amit Chakrabarti, Subhash Khot, and Xiaodong Sun. Near-Optimal Lower Bounds on the Multi-Party Communication Complexity of Set Disjointness. In18th Annual IEEE Conference on Computational Complexity (Complexity 2003), 7-10 July 2003, Aarhus, Denmark, pages 107–117. IEEE Computer Society,
work page 2003
-
[8]
[CPW20] Edith Cohen, Rasmus Pagh, and David P. Woodruff. WOR andp’s: Sketches for p-sampling without replacement. In Hugo Larochelle, Marc’Aurelio Ranzato, Raia Hadsell, Maria-Florina Balcan, and Hsuan-Tien Lin, editors,Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, Dece...
work page 2020
-
[9]
[DNSS92] David J. DeWitt, Jeffrey F. Naughton, Donovan A. Schneider, and S. Seshadri. Practical Skew Handling in Parallel Joins. In Li-Yan Yuan, editor,18th International Conference on Very Large Data Bases, August 23-27, 1992, Vancouver, Canada, Proceedings, pages 27–40. Morgan Kaufmann,
work page 1992
-
[10]
Differentially Private Continual Releases of Streaming Frequency Moment estimations
[EMM+23] Alessandro Epasto, Jieming Mao, Andres Mu˜ noz Medina, Vahab Mirrokni, Sergei Vassilvitskii, and Peilin Zhong. Differentially Private Continual Releases of Streaming Frequency Moment estimations. In Yael Tauman Kalai, editor,14th Innovations in Theoretical Computer Science Conference, ITCS 2023, January 10-13, 2023, MIT, Cambridge, Massachusetts,...
work page 2023
-
[11]
[GLH07] Rainer Gemulla, Wolfgang Lehner, and Peter J. Haas. Maintaining bernoulli samples over evolving multisets. In Leonid Libkin, editor,Proceedings of the Twenty-Sixth ACM SIGACT- SIGMOD-SIGART Symposium on Principles of Database Systems, June 11-13, 2007, Beijing, China, pages 93–102. ACM,
work page 2007
-
[12]
[HNSS95] Peter J. Haas, Jeffrey F. Naughton, S. Seshadri, and Lynne Stokes. Sampling-Based Estimation of the Number of Distinct Values of an Attribute. In Umeshwar Dayal, Peter M. D. Gray, and Shojiro Nishio, editors,VLDB’95, Proceedings of 21th International Conference on Very Large Data Bases, September 11-15, 1995, Zurich, Switzerland, pages 311–322. M...
work page 1995
-
[13]
Ioannidis and Viswanath Poosala
[IP95] Yannis E. Ioannidis and Viswanath Poosala. Balancing Histogram Optimality and Practicality for Query Result Size Estimation. In Michael J. Carey and Donovan A. Schneider, editors, Proceedings of the 1995 ACM SIGMOD International Conference on Management of Data, San Jose, California, USA, May 22-25, 1995, pages 233–244. ACM Press,
work page 1995
-
[14]
[IPW11] Piotr Indyk, Eric Price, and David P. Woodruff. On the Power of Adaptivity in Sparse Recovery. In Rafail Ostrovsky, editor,IEEE 52nd Annual Symposium on Foundations of Computer Science, FOCS 2011, Palm Springs, CA, USA, October 22-25, 2011, pages 285–294. IEEE Computer Society,
work page 2011
-
[15]
Kane, Jelani Nelson, and David P
[KNW10b] Daniel M. Kane, Jelani Nelson, and David P. Woodruff. An Optimal Algorithm for the Distinct Elements Problem. In Jan Paredaens and Dirk Van Gucht, editors,Proceedings of the Twenty- Ninth ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, PODS 2010, June 6-11, 2010, Indianapolis, Indiana, USA, pages 41–52. ACM,
work page 2010
-
[16]
[LW13] Yi Li and David P. Woodruff. A Tight Lower Bound for High Frequency Moment Estimation with Small Error. In Prasad Raghavendra, Sofya Raskhodnikova, Klaus Jansen, and Jos´ e D. P. Rolim, editors,Approximation, Randomization, and Combinatorial Optimization. Algorithms and Techniques - 16th International Workshop, APPROX 2013, and 17th International W...
work page 2013
-
[17]
39 [MW10] Morteza Monemizadeh and David P. Woodruff. 1-Pass Relative-Error L p-Sampling with Ap- plications. In Moses Charikar, editor,Proceedings of the Twenty-First Annual ACM-SIAM Symposium on Discrete Algorithms, SODA 2010, Austin, Texas, USA, January 17-19, 2010, pages 1143–1160. SIAM,
work page 2010
-
[18]
Nolan.Univariate Stable Distributions: Models for Heavy Tailed Data
[Nol20] John P. Nolan.Univariate Stable Distributions: Models for Heavy Tailed Data. Springer Series in Operations Research and Financial Engineering. Springer, Cham, [2020]©2020. [NY22] Jelani Nelson and Huacheng Yu. Optimal Bounds for Approximate Counting. In Leonid Libkin and Pablo Barcel´ o, editors,PODS ’22: International Conference on Management of ...
work page 2020
-
[19]
[PSW14] Rasmus Pagh, Morten St¨ ockel, and David P. Woodruff. Is Min-Wise Hashing Optimal for Summarizing Set Intersection? InProceedings of the 33rd ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, PODS’14, Snowbird, UT, USA, June 22-27, 2014, pages 109–120,
work page 2014
-
[20]
Universal perfect samplers for incremental streams
[PW25] Seth Pettie and Dingyu Wang. Universal perfect samplers for incremental streams. In Yossi Azar and Debmalya Panigrahi, editors,Proceedings of the 2025 Annual ACM-SIAM Symposium on Discrete Algorithms, SODA 2025, New Orleans, LA, USA, January 12-15, 2025, pages 3409–
work page 2025
-
[21]
Differentially Private Fractional Frequency Moments Estimation with Polylogarithmic Space
[WPS22] Lun Wang, Iosif Pinelis, and Dawn Song. Differentially Private Fractional Frequency Moments Estimation with Polylogarithmic Space. InThe Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25-29,
work page 2022
-
[22]
[WZ21] David P. Woodruff and Samson Zhou. Separations for Estimating Large Frequency Moments on Data Streams. In Nikhil Bansal, Emanuela Merelli, and James Worrell, editors,48th In- ternational Colloquium on Automata, Languages, and Programming, ICALP 2021, July 12-16, 2021, Glasgow, Scotland (Virtual Conference), volume 198 ofLIPIcs, pages 112:1–112:21. ...
work page 2021
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.