StreamSampling.jl: Efficient Sampling from Data Streams in Julia
Pith reviewed 2026-05-15 06:26 UTC · model grok-4.3
The pith
StreamSampling.jl enables unbiased sampling from data streams of unknown size in one pass while using only constant memory.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
StreamSampling.jl provides general and efficient methods for sampling from data streams in a single pass, even when the total number of items is unknown, while maintaining a small, constant memory footprint and demonstrating performance improvements over standard approaches.
What carries the argument
Online reservoir sampling algorithms that maintain a fixed-size sample and probabilistically replace elements as new stream items arrive.
If this is right
- Processing very large or infinite data streams becomes feasible without exhausting memory.
- Samples remain statistically unbiased even as the stream continues indefinitely.
- Applications in real-time data analysis avoid the overhead of full data materialization.
- Direct comparisons show reduced runtime and memory compared to offline sampling methods.
Where Pith is reading between the lines
- Such libraries could support streaming machine learning models that update on sampled batches.
- Integration with distributed systems might allow sampling across multiple data sources.
- Extensions to weighted sampling or other variants could broaden its utility in statistical computing.
Load-bearing premise
The implemented algorithms correctly produce unbiased samples and the benchmark results reflect genuine performance gains without selective testing.
What would settle it
Demonstrating that sample statistics deviate from expected distributions for known stream distributions or that memory usage increases beyond a constant bound would falsify the central claim.
Figures
read the original abstract
StreamSampling$.$jl is a Julia library designed to provide general and efficient methods for sampling from data streams in a single pass, even when the total number of items is unknown. In this paper, we describe the capabilities of the library and its advantages over traditional sampling procedures, such as maintaining a small, constant memory footprint and avoiding the need to fully materialize the stream in memory. Furthermore, we provide empirical benchmarks comparing online sampling methods against standard approaches, demonstrating performance and memory improvements.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces StreamSampling.jl, a Julia library implementing single-pass reservoir sampling and variants for data streams of unknown length. It claims these methods maintain O(k) memory independent of stream size N, produce unbiased samples, and outperform traditional approaches that require full materialization, supported by empirical benchmarks on performance and memory usage.
Significance. If the implementations are correct and the benchmarks reproducible, the library would provide a practical, lightweight tool for online sampling in Julia, filling a gap for stream processing applications where memory constraints are tight and full data access is impossible.
major comments (3)
- [Implementation description] The manuscript provides no pseudocode, key function listings, or mathematical description of the implemented algorithms (e.g., reservoir sampling with unknown N). Without these, it is impossible to confirm that the Julia code exactly reproduces the standard unbiased procedures rather than introducing off-by-one errors or incorrect weighting.
- [Empirical benchmarks] Benchmarks are mentioned but lack any description of stream distributions, sizes N, hardware, number of trials, or controls for JIT warm-up effects. This prevents verification that reported speedups and memory savings are genuine rather than artifacts of test choice or post-hoc selection.
- [Results and validation] No statistical validation (e.g., chi-squared tests on sample frequencies or empirical distribution checks) is reported to confirm that the outputs match the theoretical uniform distribution required for unbiased sampling.
minor comments (2)
- [Abstract] The abstract and introduction repeat the same high-level claims without distinguishing the library's novel contributions from standard reservoir sampling algorithms.
- [Introduction] Missing references to foundational papers on reservoir sampling (e.g., Vitter 1985 or Li 1994) would help situate the work.
Simulated Author's Rebuttal
We thank the referee for their thoughtful and constructive report. We address each major comment point by point below and indicate where revisions will be made to strengthen the manuscript.
read point-by-point responses
-
Referee: [Implementation description] The manuscript provides no pseudocode, key function listings, or mathematical description of the implemented algorithms (e.g., reservoir sampling with unknown N). Without these, it is impossible to confirm that the Julia code exactly reproduces the standard unbiased procedures rather than introducing off-by-one errors or incorrect weighting.
Authors: We agree that the manuscript would benefit from explicit algorithmic details. In the revised version we will add a dedicated section containing pseudocode for the core reservoir sampling procedures (including the unknown-N variant), key function signatures from the library, and the mathematical formulation of selection probabilities that establishes unbiasedness. revision: yes
-
Referee: [Empirical benchmarks] Benchmarks are mentioned but lack any description of stream distributions, sizes N, hardware, number of trials, or controls for JIT warm-up effects. This prevents verification that reported speedups and memory savings are genuine rather than artifacts of test choice or post-hoc selection.
Authors: We accept this criticism. The revised manuscript will expand the experimental section to report the exact stream sizes (N ranging from 10^5 to 10^8), input distributions (uniform and power-law), hardware configuration, number of repetitions (at least 100 independent trials), and the JIT warm-up protocol used before timing measurements. revision: yes
-
Referee: [Results and validation] No statistical validation (e.g., chi-squared tests on sample frequencies or empirical distribution checks) is reported to confirm that the outputs match the theoretical uniform distribution required for unbiased sampling.
Authors: Although the implemented algorithms follow well-known theoretically unbiased procedures, we agree that empirical confirmation would increase confidence. We will add chi-squared goodness-of-fit tests and empirical cumulative distribution checks on the sampled indices in the revised results section. revision: yes
Circularity Check
No circularity: software implementation of established algorithms with no derivation chain
full rationale
The manuscript describes a Julia library for stream sampling, referencing standard single-pass algorithms such as reservoir sampling for unknown stream lengths. No equations, parameter fitting, self-citations, or uniqueness theorems appear in the provided text or abstract. Claims of unbiased samples and constant memory rest on the correctness of the (unshown) implementation of well-known methods rather than any internal derivation that reduces to its own inputs. Benchmarks are empirical comparisons, not predictions derived from fitted parameters. This is a typical non-circular software paper; the derivation chain is absent by design.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
The methods covered here fall under the umbrella ofonline sampling, i.e
Introduction Random sampling from data streams is a fundamental operation in data analysis, particularly when dealing with massive datasets that exceed available memory or continuous streams of indeterminate length. The methods covered here fall under the umbrella ofonline sampling, i.e. algorithms that process stream elements sequentially in a single pas...
-
[2]
Related Work Online sampling techniques have been implemented across a range of programming languages and frameworks, typically addressing specific use cases rather than providing a unified suite of algo- rithms. For instance, eBay’stsv-utilsprovide thetsv-sample command-line tool, which supports simple random sampling and weighted reservoir sampling over...
-
[3]
sequential), the sampling scheme (with vs
Implemented Methods StreamSampling.jlcategorizes its implemented methods along three main axes: the underlying logic (reservoir vs. sequential), the sampling scheme (with vs. without replacement), and the inclusion probabilities (weighted vs. unweighted). For reservoir sampling without replacement,AlgRandAlgL
-
[4]
StreamSampling.jl: Efficient Sampling from Data Streams in Julia
are provided, whereasAlgRSWRSKIP[11] covers the 1 arXiv:2603.21996v1 [cs.SE] 23 Mar 2026 with-replacement counterpart. For weighted reservoir sampling, AlgAResandAlgAExpJ[7] handle the without-replacement case, andAlgWRSWRSKIP[9] is used for the with-replacement case. In the sequential sampling domain, methods likeAlgD[16] and AlgHiddenShuffle[12] are emp...
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[5]
Sampling API The interface ofStreamSampling.jlrelies on two main sam- plers, each made for its sampling domain:ReservoirSampler andSequentialSampler. The package also providesitsample, which, similarly toStatsBase.sample, returns anArray, but, by using stream methods, it can be applied to any iterator. 4.1 Reservoir Samplers Following theOnlineStats.jl[6]...
-
[6]
Benchmarks Several benchmarks were conducted to evaluate the performance of the reservoir and sequential sampling implementations1. All bench- marks were run on a machine with an AMD Ryzen 5 5600H (6 cores) and 16GB of RAM, running Ubuntu 24.04 LTS and Julia 1.12. To evaluate the algorithms’ core computational performance, we benchmarked them on a simple ...
-
[7]
Applications The algorithms provided byStreamSampling.jlcan be applied across various data-intensive domains including approximate query 1The benchmarks and applications described are available at https://github.com/JuliaDynamics/StreamSampling.jl/tree/ main/benchmark. processing, online log monitoring, querying massive network ana- lytics feeds, and sens...
-
[8]
Conclusion StreamSampling.jlintroduces a comprehensive suite of algo- rithms covering both reservoir and sequential paradigms, filling a gap in the Julia ecosystem. The ability to sample from any Julia iterator without materializing it in memory extends the reach of random sampling to contexts such as lazy data pipelines and large on-disk datasets. Future...
-
[9]
Stratified reser- voir sampling over heterogeneous data streams
Mohammed Al-Kateb and Byung Suk Lee. Stratified reser- voir sampling over heterogeneous data streams. In Michael Gertz and Bertram Ludäscher, editors,Scientific and Statisti- cal Database Management, pages 621–639. Springer Berlin Heidelberg, 2010
work page 2010
-
[10]
Jon Louis Bentley and James B. Saxe. Generating sorted lists of random numbers.ACM Transactions on Mathematical Software, 6(3):359–364, 1980. doi:10.1145/355900.355907
-
[11]
Jeff Bezanson, Alan Edelman, Stefan Karpinski, and Vi- ral B. Shah. Julia: A fresh approach to numerical computing. SIAM Review, 59(1):65–98, 2017. doi:10.1137/141000671. https://doi.org/10.1137/141000671
-
[12]
Optimal sampling from sliding windows.Journal of Computer and System Sciences, 78(1):260–272, 2012
Vladimir Braverman, Rafail Ostrovsky, and Carlo Zan- iolo. Optimal sampling from sliding windows.Journal of Computer and System Sciences, 78(1):260–272, 2012. doi:https://doi.org/10.1016/j.jcss.2011.04.004. JCSS Knowl- edge Representation and Reasoning
-
[13]
Graham Cormode, Minos Garofalakis, Peter J. Haas, and Chris Jermaine. Synopses for massive data: Samples, his- tograms, wavelets, sketches.Foundations and Trends in Databases, 4(1-3):1–294, 2011. doi:10.1561/1900000004. https://www.emerald.com/ftdbs/article-pdf/4/1- 3/1/11024120/1900000004en.pdf
-
[14]
Josh Day and Hua Zhou. Onlinestats.jl: A julia package for statistics on data streams.Journal of Open Source Software, 5(46):1816, 2020. doi:10.21105/joss.01816
-
[15]
Weighted random sampling with a reservoir
Pavlos S. Efraimidis and Paul G. Spirakis. Weighted random sampling with a reservoir.Infor- mation Processing Letters, 97(5):181–185, 2006. doi:https://doi.org/10.1016/j.ipl.2005.11.003
-
[16]
gnu coreutils.Free Software Founda- tion, Inc, 2022, 1994
David MacKenzie et al. gnu coreutils.Free Software Founda- tion, Inc, 2022, 1994
work page 2022
-
[17]
Weighted reser- voir sampling with replacement from data streams
Adriano Meligrana and Adriano Fazzone. Weighted reser- voir sampling with replacement from data streams. In Proceedings of the ACM on Web Conference 2026, New York, NY , USA, 2026. Association for Computing Machin- ery. doi:10.1145/3774904.3792966. Accepted for publica- tion. Currently available as arXiv preprint arXiv:2403.20256
-
[18]
River: machine learning for streaming data in python.Journal of Machine Learning Research, 22:1–8,
Jacob Montiel, Max Halford, Saulo Martiello Mastelini, Ge- offrey Bolmier, Raphaël Sourty, Robin Vaysse, Adil Zoui- tine, Heitor Murilo Gomes, Jesse Read, Talel Abdessalem, and Albert Bifet. River: machine learning for streaming data in python.Journal of Machine Learning Research, 22:1–8,
-
[19]
doi:10.48550/arXiv.2012.04740
-
[20]
Byung-Hoon Park, George Ostrouchov, and Nagiza F. Sam- atova. Sampling streaming data with replacement.Compu- tational Statistics & Data Analysis, 52(2):750–762, 2007. doi:https://doi.org/10.1016/j.csda.2007.03.010
-
[21]
Sequential ran- dom sampling revisited: Hidden shuffle method
Michael Shekelyan and Graham Cormode. Sequential ran- dom sampling revisited: Hidden shuffle method. In Arindam Banerjee and Kenji Fukumizu, editors,Proceedings of The 24th International Conference on Artificial Intelligence and Statistics, volume 130 ofProceedings of Machine Learning Research, pages 3628–3636. PMLR, 2021
work page 2021
-
[22]
An asymptotically optimal, online algorithm for weighted random sampling with replacement
Michal Startek. An asymptotically optimal, online algorithm for weighted random sampling with replacement.CoRR, abs/1611.00532, 2016. arXiv:1611.00532
work page internal anchor Pith review Pith/arXiv arXiv 2016
-
[23]
The big data ecosystem at linkedin
Roshan Sumbaly, Jay Kreps, and Sam Shah. The big data ecosystem at linkedin. InProceedings of the 2013 ACM SIG- MOD International Conference on Management of Data, SIGMOD ’13, page 1125–1134. Association for Computing Machinery, 2013. doi:10.1145/2463676.2463707
-
[24]
Jeffrey S. Vitter. Random sampling with a reservoir.ACM Transactions on Mathematical Software, 11(1):37–57, 1985. doi:10.1145/3147.3165
-
[25]
Jeffrey Scott Vitter. An efficient algorithm for sequential ran- dom sampling.ACM Transactions on Mathematical Software, 13(1):58–67, 1987. doi:10.1145/23002.23003
-
[26]
Efficient trans- mission and reconstruction of dependent data streams via edge sampling
Joel Wolfrath and Abhishek Chandra. Efficient trans- mission and reconstruction of dependent data streams via edge sampling. In2022 IEEE International Confer- ence on Cloud Engineering (IC2E), pages 47–57, 2022. doi:10.1109/IC2E55432.2022.00013. 4
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.