pith. sign in

arxiv: 2603.21996 · v2 · pith:A6XA777Unew · submitted 2026-03-23 · 💻 cs.SE · stat.CO

StreamSampling.jl: Efficient Sampling from Data Streams in Julia

Pith reviewed 2026-05-15 06:26 UTC · model grok-4.3

classification 💻 cs.SE stat.CO
keywords stream samplingreservoir samplingJulia programmingdata streamsonline algorithmssingle-pass processingconstant memory
0
0 comments X

The pith

StreamSampling.jl enables unbiased sampling from data streams of unknown size in one pass while using only constant memory.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper presents a Julia library for sampling from continuous data streams without knowing how many items will arrive. Traditional methods require storing the full stream first, but this approach processes items on the fly. It keeps a small fixed amount of memory regardless of stream length. The library includes benchmarks showing faster performance and lower memory use than loading everything into memory first.

Core claim

StreamSampling.jl provides general and efficient methods for sampling from data streams in a single pass, even when the total number of items is unknown, while maintaining a small, constant memory footprint and demonstrating performance improvements over standard approaches.

What carries the argument

Online reservoir sampling algorithms that maintain a fixed-size sample and probabilistically replace elements as new stream items arrive.

If this is right

  • Processing very large or infinite data streams becomes feasible without exhausting memory.
  • Samples remain statistically unbiased even as the stream continues indefinitely.
  • Applications in real-time data analysis avoid the overhead of full data materialization.
  • Direct comparisons show reduced runtime and memory compared to offline sampling methods.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Such libraries could support streaming machine learning models that update on sampled batches.
  • Integration with distributed systems might allow sampling across multiple data sources.
  • Extensions to weighted sampling or other variants could broaden its utility in statistical computing.

Load-bearing premise

The implemented algorithms correctly produce unbiased samples and the benchmark results reflect genuine performance gains without selective testing.

What would settle it

Demonstrating that sample statistics deviate from expected distributions for known stream distributions or that memory usage increases beyond a constant bound would falsify the central claim.

Figures

Figures reproduced from arXiv: 2603.21996 by Adriano Meligrana.

Figure 1
Figure 1. Figure 1: Comparison of median execution time (ms) and memory allocation [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
read the original abstract

StreamSampling$.$jl is a Julia library designed to provide general and efficient methods for sampling from data streams in a single pass, even when the total number of items is unknown. In this paper, we describe the capabilities of the library and its advantages over traditional sampling procedures, such as maintaining a small, constant memory footprint and avoiding the need to fully materialize the stream in memory. Furthermore, we provide empirical benchmarks comparing online sampling methods against standard approaches, demonstrating performance and memory improvements.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper introduces StreamSampling.jl, a Julia library implementing single-pass reservoir sampling and variants for data streams of unknown length. It claims these methods maintain O(k) memory independent of stream size N, produce unbiased samples, and outperform traditional approaches that require full materialization, supported by empirical benchmarks on performance and memory usage.

Significance. If the implementations are correct and the benchmarks reproducible, the library would provide a practical, lightweight tool for online sampling in Julia, filling a gap for stream processing applications where memory constraints are tight and full data access is impossible.

major comments (3)
  1. [Implementation description] The manuscript provides no pseudocode, key function listings, or mathematical description of the implemented algorithms (e.g., reservoir sampling with unknown N). Without these, it is impossible to confirm that the Julia code exactly reproduces the standard unbiased procedures rather than introducing off-by-one errors or incorrect weighting.
  2. [Empirical benchmarks] Benchmarks are mentioned but lack any description of stream distributions, sizes N, hardware, number of trials, or controls for JIT warm-up effects. This prevents verification that reported speedups and memory savings are genuine rather than artifacts of test choice or post-hoc selection.
  3. [Results and validation] No statistical validation (e.g., chi-squared tests on sample frequencies or empirical distribution checks) is reported to confirm that the outputs match the theoretical uniform distribution required for unbiased sampling.
minor comments (2)
  1. [Abstract] The abstract and introduction repeat the same high-level claims without distinguishing the library's novel contributions from standard reservoir sampling algorithms.
  2. [Introduction] Missing references to foundational papers on reservoir sampling (e.g., Vitter 1985 or Li 1994) would help situate the work.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their thoughtful and constructive report. We address each major comment point by point below and indicate where revisions will be made to strengthen the manuscript.

read point-by-point responses
  1. Referee: [Implementation description] The manuscript provides no pseudocode, key function listings, or mathematical description of the implemented algorithms (e.g., reservoir sampling with unknown N). Without these, it is impossible to confirm that the Julia code exactly reproduces the standard unbiased procedures rather than introducing off-by-one errors or incorrect weighting.

    Authors: We agree that the manuscript would benefit from explicit algorithmic details. In the revised version we will add a dedicated section containing pseudocode for the core reservoir sampling procedures (including the unknown-N variant), key function signatures from the library, and the mathematical formulation of selection probabilities that establishes unbiasedness. revision: yes

  2. Referee: [Empirical benchmarks] Benchmarks are mentioned but lack any description of stream distributions, sizes N, hardware, number of trials, or controls for JIT warm-up effects. This prevents verification that reported speedups and memory savings are genuine rather than artifacts of test choice or post-hoc selection.

    Authors: We accept this criticism. The revised manuscript will expand the experimental section to report the exact stream sizes (N ranging from 10^5 to 10^8), input distributions (uniform and power-law), hardware configuration, number of repetitions (at least 100 independent trials), and the JIT warm-up protocol used before timing measurements. revision: yes

  3. Referee: [Results and validation] No statistical validation (e.g., chi-squared tests on sample frequencies or empirical distribution checks) is reported to confirm that the outputs match the theoretical uniform distribution required for unbiased sampling.

    Authors: Although the implemented algorithms follow well-known theoretically unbiased procedures, we agree that empirical confirmation would increase confidence. We will add chi-squared goodness-of-fit tests and empirical cumulative distribution checks on the sampled indices in the revised results section. revision: yes

Circularity Check

0 steps flagged

No circularity: software implementation of established algorithms with no derivation chain

full rationale

The manuscript describes a Julia library for stream sampling, referencing standard single-pass algorithms such as reservoir sampling for unknown stream lengths. No equations, parameter fitting, self-citations, or uniqueness theorems appear in the provided text or abstract. Claims of unbiased samples and constant memory rest on the correctness of the (unshown) implementation of well-known methods rather than any internal derivation that reduces to its own inputs. Benchmarks are empirical comparisons, not predictions derived from fitted parameters. This is a typical non-circular software paper; the derivation chain is absent by design.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The work describes a software library that implements standard reservoir sampling and related one-pass algorithms. No new free parameters, axioms, or invented entities are introduced beyond the choice of Julia as the implementation language.

pith-pipeline@v0.9.0 · 5359 in / 1022 out tokens · 37496 ms · 2026-05-15T06:26:24.730470+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

26 extracted references · 26 canonical work pages · 2 internal anchors

  1. [1]

    The methods covered here fall under the umbrella ofonline sampling, i.e

    Introduction Random sampling from data streams is a fundamental operation in data analysis, particularly when dealing with massive datasets that exceed available memory or continuous streams of indeterminate length. The methods covered here fall under the umbrella ofonline sampling, i.e. algorithms that process stream elements sequentially in a single pas...

  2. [2]

    Related Work Online sampling techniques have been implemented across a range of programming languages and frameworks, typically addressing specific use cases rather than providing a unified suite of algo- rithms. For instance, eBay’stsv-utilsprovide thetsv-sample command-line tool, which supports simple random sampling and weighted reservoir sampling over...

  3. [3]

    sequential), the sampling scheme (with vs

    Implemented Methods StreamSampling.jlcategorizes its implemented methods along three main axes: the underlying logic (reservoir vs. sequential), the sampling scheme (with vs. without replacement), and the inclusion probabilities (weighted vs. unweighted). For reservoir sampling without replacement,AlgRandAlgL

  4. [4]

    StreamSampling.jl: Efficient Sampling from Data Streams in Julia

    are provided, whereasAlgRSWRSKIP[11] covers the 1 arXiv:2603.21996v1 [cs.SE] 23 Mar 2026 with-replacement counterpart. For weighted reservoir sampling, AlgAResandAlgAExpJ[7] handle the without-replacement case, andAlgWRSWRSKIP[9] is used for the with-replacement case. In the sequential sampling domain, methods likeAlgD[16] and AlgHiddenShuffle[12] are emp...

  5. [5]

    The package also providesitsample, which, similarly toStatsBase.sample, returns anArray, but, by using stream methods, it can be applied to any iterator

    Sampling API The interface ofStreamSampling.jlrelies on two main sam- plers, each made for its sampling domain:ReservoirSampler andSequentialSampler. The package also providesitsample, which, similarly toStatsBase.sample, returns anArray, but, by using stream methods, it can be applied to any iterator. 4.1 Reservoir Samplers Following theOnlineStats.jl[6]...

  6. [6]

    All bench- marks were run on a machine with an AMD Ryzen 5 5600H (6 cores) and 16GB of RAM, running Ubuntu 24.04 LTS and Julia 1.12

    Benchmarks Several benchmarks were conducted to evaluate the performance of the reservoir and sequential sampling implementations1. All bench- marks were run on a machine with an AMD Ryzen 5 5600H (6 cores) and 16GB of RAM, running Ubuntu 24.04 LTS and Julia 1.12. To evaluate the algorithms’ core computational performance, we benchmarked them on a simple ...

  7. [7]

    Applications The algorithms provided byStreamSampling.jlcan be applied across various data-intensive domains including approximate query 1The benchmarks and applications described are available at https://github.com/JuliaDynamics/StreamSampling.jl/tree/ main/benchmark. processing, online log monitoring, querying massive network ana- lytics feeds, and sens...

  8. [8]

    Conclusion StreamSampling.jlintroduces a comprehensive suite of algo- rithms covering both reservoir and sequential paradigms, filling a gap in the Julia ecosystem. The ability to sample from any Julia iterator without materializing it in memory extends the reach of random sampling to contexts such as lazy data pipelines and large on-disk datasets. Future...

  9. [9]

    Stratified reser- voir sampling over heterogeneous data streams

    Mohammed Al-Kateb and Byung Suk Lee. Stratified reser- voir sampling over heterogeneous data streams. In Michael Gertz and Bertram Ludäscher, editors,Scientific and Statisti- cal Database Management, pages 621–639. Springer Berlin Heidelberg, 2010

  10. [10]

    Jon Louis Bentley and James B. Saxe. Generating sorted lists of random numbers.ACM Transactions on Mathematical Software, 6(3):359–364, 1980. doi:10.1145/355900.355907

  11. [11]

    Jeff Bezanson, Alan Edelman, Stefan Karpinski, and Vi- ral B. Shah. Julia: A fresh approach to numerical computing. SIAM Review, 59(1):65–98, 2017. doi:10.1137/141000671. https://doi.org/10.1137/141000671

  12. [12]

    Optimal sampling from sliding windows.Journal of Computer and System Sciences, 78(1):260–272, 2012

    Vladimir Braverman, Rafail Ostrovsky, and Carlo Zan- iolo. Optimal sampling from sliding windows.Journal of Computer and System Sciences, 78(1):260–272, 2012. doi:https://doi.org/10.1016/j.jcss.2011.04.004. JCSS Knowl- edge Representation and Reasoning

  13. [13]

    Haas, and Chris Jermaine

    Graham Cormode, Minos Garofalakis, Peter J. Haas, and Chris Jermaine. Synopses for massive data: Samples, his- tograms, wavelets, sketches.Foundations and Trends in Databases, 4(1-3):1–294, 2011. doi:10.1561/1900000004. https://www.emerald.com/ftdbs/article-pdf/4/1- 3/1/11024120/1900000004en.pdf

  14. [14]

    Onlinestats.jl: A julia package for statistics on data streams.Journal of Open Source Software, 5(46):1816, 2020

    Josh Day and Hua Zhou. Onlinestats.jl: A julia package for statistics on data streams.Journal of Open Source Software, 5(46):1816, 2020. doi:10.21105/joss.01816

  15. [15]

    Weighted random sampling with a reservoir

    Pavlos S. Efraimidis and Paul G. Spirakis. Weighted random sampling with a reservoir.Infor- mation Processing Letters, 97(5):181–185, 2006. doi:https://doi.org/10.1016/j.ipl.2005.11.003

  16. [16]

    gnu coreutils.Free Software Founda- tion, Inc, 2022, 1994

    David MacKenzie et al. gnu coreutils.Free Software Founda- tion, Inc, 2022, 1994

  17. [17]

    Weighted reser- voir sampling with replacement from data streams

    Adriano Meligrana and Adriano Fazzone. Weighted reser- voir sampling with replacement from data streams. In Proceedings of the ACM on Web Conference 2026, New York, NY , USA, 2026. Association for Computing Machin- ery. doi:10.1145/3774904.3792966. Accepted for publica- tion. Currently available as arXiv preprint arXiv:2403.20256

  18. [18]

    River: machine learning for streaming data in python.Journal of Machine Learning Research, 22:1–8,

    Jacob Montiel, Max Halford, Saulo Martiello Mastelini, Ge- offrey Bolmier, Raphaël Sourty, Robin Vaysse, Adil Zoui- tine, Heitor Murilo Gomes, Jesse Read, Talel Abdessalem, and Albert Bifet. River: machine learning for streaming data in python.Journal of Machine Learning Research, 22:1–8,

  19. [19]

    doi:10.48550/arXiv.2012.04740

  20. [20]

    Sam- atova

    Byung-Hoon Park, George Ostrouchov, and Nagiza F. Sam- atova. Sampling streaming data with replacement.Compu- tational Statistics & Data Analysis, 52(2):750–762, 2007. doi:https://doi.org/10.1016/j.csda.2007.03.010

  21. [21]

    Sequential ran- dom sampling revisited: Hidden shuffle method

    Michael Shekelyan and Graham Cormode. Sequential ran- dom sampling revisited: Hidden shuffle method. In Arindam Banerjee and Kenji Fukumizu, editors,Proceedings of The 24th International Conference on Artificial Intelligence and Statistics, volume 130 ofProceedings of Machine Learning Research, pages 3628–3636. PMLR, 2021

  22. [22]

    An asymptotically optimal, online algorithm for weighted random sampling with replacement

    Michal Startek. An asymptotically optimal, online algorithm for weighted random sampling with replacement.CoRR, abs/1611.00532, 2016. arXiv:1611.00532

  23. [23]

    The big data ecosystem at linkedin

    Roshan Sumbaly, Jay Kreps, and Sam Shah. The big data ecosystem at linkedin. InProceedings of the 2013 ACM SIG- MOD International Conference on Management of Data, SIGMOD ’13, page 1125–1134. Association for Computing Machinery, 2013. doi:10.1145/2463676.2463707

  24. [24]

    Jeffrey S. Vitter. Random sampling with a reservoir.ACM Transactions on Mathematical Software, 11(1):37–57, 1985. doi:10.1145/3147.3165

  25. [25]

    An efficient algorithm for sequential ran- dom sampling.ACM Transactions on Mathematical Software, 13(1):58–67, 1987

    Jeffrey Scott Vitter. An efficient algorithm for sequential ran- dom sampling.ACM Transactions on Mathematical Software, 13(1):58–67, 1987. doi:10.1145/23002.23003

  26. [26]

    Efficient trans- mission and reconstruction of dependent data streams via edge sampling

    Joel Wolfrath and Abhishek Chandra. Efficient trans- mission and reconstruction of dependent data streams via edge sampling. In2022 IEEE International Confer- ence on Cloud Engineering (IC2E), pages 47–57, 2022. doi:10.1109/IC2E55432.2022.00013. 4