StreamSampling.jl: Efficient Sampling from Data Streams in Julia

Adriano Meligrana

arxiv: 2603.21996 · v2 · pith:A6XA777Unew · submitted 2026-03-23 · 💻 cs.SE · stat.CO

StreamSampling.jl: Efficient Sampling from Data Streams in Julia

Adriano Meligrana This is my paper

Pith reviewed 2026-05-15 06:26 UTC · model grok-4.3

classification 💻 cs.SE stat.CO

keywords stream samplingreservoir samplingJulia programmingdata streamsonline algorithmssingle-pass processingconstant memory

0 comments

The pith

StreamSampling.jl enables unbiased sampling from data streams of unknown size in one pass while using only constant memory.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper presents a Julia library for sampling from continuous data streams without knowing how many items will arrive. Traditional methods require storing the full stream first, but this approach processes items on the fly. It keeps a small fixed amount of memory regardless of stream length. The library includes benchmarks showing faster performance and lower memory use than loading everything into memory first.

Core claim

StreamSampling.jl provides general and efficient methods for sampling from data streams in a single pass, even when the total number of items is unknown, while maintaining a small, constant memory footprint and demonstrating performance improvements over standard approaches.

What carries the argument

Online reservoir sampling algorithms that maintain a fixed-size sample and probabilistically replace elements as new stream items arrive.

If this is right

Processing very large or infinite data streams becomes feasible without exhausting memory.
Samples remain statistically unbiased even as the stream continues indefinitely.
Applications in real-time data analysis avoid the overhead of full data materialization.
Direct comparisons show reduced runtime and memory compared to offline sampling methods.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Such libraries could support streaming machine learning models that update on sampled batches.
Integration with distributed systems might allow sampling across multiple data sources.
Extensions to weighted sampling or other variants could broaden its utility in statistical computing.

Load-bearing premise

The implemented algorithms correctly produce unbiased samples and the benchmark results reflect genuine performance gains without selective testing.

What would settle it

Demonstrating that sample statistics deviate from expected distributions for known stream distributions or that memory usage increases beyond a constant bound would falsify the central claim.

Figures

Figures reproduced from arXiv: 2603.21996 by Adriano Meligrana.

read the original abstract

StreamSampling$.$jl is a Julia library designed to provide general and efficient methods for sampling from data streams in a single pass, even when the total number of items is unknown. In this paper, we describe the capabilities of the library and its advantages over traditional sampling procedures, such as maintaining a small, constant memory footprint and avoiding the need to fully materialize the stream in memory. Furthermore, we provide empirical benchmarks comparing online sampling methods against standard approaches, demonstrating performance and memory improvements.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This is a practical Julia package for standard reservoir sampling with some benchmarks, but the manuscript gives almost no implementation details or validation.

read the letter

The paper describes StreamSampling.jl, a library that lets you sample from data streams in one pass without knowing the total length ahead of time. It keeps memory use small and constant, which is the main practical point. The work packages well-known algorithms like reservoir sampling for Julia users and includes some empirical comparisons to show speed and memory gains over loading the full stream first. That packaging and the benchmarks are what the authors actually contribute here. For someone already working in Julia on streaming data tasks, the library could be a convenient drop-in tool that avoids reinventing the sampling logic. The abstract lays out the advantages plainly enough. The soft spots are straightforward. The text stays high-level and supplies neither code listings, pseudocode, nor any statistical checks that the samples match the expected distributions. The benchmarks are mentioned without details on stream sizes, distributions, hardware, or how JIT effects were handled, so the performance claims cannot be verified from the manuscript. This is not a theoretical paper; the underlying methods predate it. It is aimed at Julia practitioners who need online sampling for large or real-time data. A reader looking for ready-to-use software in that ecosystem would find the description and benchmarks useful. The work is coherent on its own terms and shows clear engagement with the practical problem, so it deserves peer review. I would send it to referees with the expectation that they will request more implementation evidence and benchmark methodology.

Referee Report

3 major / 2 minor

Summary. The paper introduces StreamSampling.jl, a Julia library implementing single-pass reservoir sampling and variants for data streams of unknown length. It claims these methods maintain O(k) memory independent of stream size N, produce unbiased samples, and outperform traditional approaches that require full materialization, supported by empirical benchmarks on performance and memory usage.

Significance. If the implementations are correct and the benchmarks reproducible, the library would provide a practical, lightweight tool for online sampling in Julia, filling a gap for stream processing applications where memory constraints are tight and full data access is impossible.

major comments (3)

[Implementation description] The manuscript provides no pseudocode, key function listings, or mathematical description of the implemented algorithms (e.g., reservoir sampling with unknown N). Without these, it is impossible to confirm that the Julia code exactly reproduces the standard unbiased procedures rather than introducing off-by-one errors or incorrect weighting.
[Empirical benchmarks] Benchmarks are mentioned but lack any description of stream distributions, sizes N, hardware, number of trials, or controls for JIT warm-up effects. This prevents verification that reported speedups and memory savings are genuine rather than artifacts of test choice or post-hoc selection.
[Results and validation] No statistical validation (e.g., chi-squared tests on sample frequencies or empirical distribution checks) is reported to confirm that the outputs match the theoretical uniform distribution required for unbiased sampling.

minor comments (2)

[Abstract] The abstract and introduction repeat the same high-level claims without distinguishing the library's novel contributions from standard reservoir sampling algorithms.
[Introduction] Missing references to foundational papers on reservoir sampling (e.g., Vitter 1985 or Li 1994) would help situate the work.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their thoughtful and constructive report. We address each major comment point by point below and indicate where revisions will be made to strengthen the manuscript.

read point-by-point responses

Referee: [Implementation description] The manuscript provides no pseudocode, key function listings, or mathematical description of the implemented algorithms (e.g., reservoir sampling with unknown N). Without these, it is impossible to confirm that the Julia code exactly reproduces the standard unbiased procedures rather than introducing off-by-one errors or incorrect weighting.

Authors: We agree that the manuscript would benefit from explicit algorithmic details. In the revised version we will add a dedicated section containing pseudocode for the core reservoir sampling procedures (including the unknown-N variant), key function signatures from the library, and the mathematical formulation of selection probabilities that establishes unbiasedness. revision: yes
Referee: [Empirical benchmarks] Benchmarks are mentioned but lack any description of stream distributions, sizes N, hardware, number of trials, or controls for JIT warm-up effects. This prevents verification that reported speedups and memory savings are genuine rather than artifacts of test choice or post-hoc selection.

Authors: We accept this criticism. The revised manuscript will expand the experimental section to report the exact stream sizes (N ranging from 10^5 to 10^8), input distributions (uniform and power-law), hardware configuration, number of repetitions (at least 100 independent trials), and the JIT warm-up protocol used before timing measurements. revision: yes
Referee: [Results and validation] No statistical validation (e.g., chi-squared tests on sample frequencies or empirical distribution checks) is reported to confirm that the outputs match the theoretical uniform distribution required for unbiased sampling.

Authors: Although the implemented algorithms follow well-known theoretically unbiased procedures, we agree that empirical confirmation would increase confidence. We will add chi-squared goodness-of-fit tests and empirical cumulative distribution checks on the sampled indices in the revised results section. revision: yes

Circularity Check

0 steps flagged

No circularity: software implementation of established algorithms with no derivation chain

full rationale

The manuscript describes a Julia library for stream sampling, referencing standard single-pass algorithms such as reservoir sampling for unknown stream lengths. No equations, parameter fitting, self-citations, or uniqueness theorems appear in the provided text or abstract. Claims of unbiased samples and constant memory rest on the correctness of the (unshown) implementation of well-known methods rather than any internal derivation that reduces to its own inputs. Benchmarks are empirical comparisons, not predictions derived from fitted parameters. This is a typical non-circular software paper; the derivation chain is absent by design.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The work describes a software library that implements standard reservoir sampling and related one-pass algorithms. No new free parameters, axioms, or invented entities are introduced beyond the choice of Julia as the implementation language.

pith-pipeline@v0.9.0 · 5359 in / 1022 out tokens · 37496 ms · 2026-05-15T06:26:24.730470+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

26 extracted references · 26 canonical work pages · 2 internal anchors

[1]

The methods covered here fall under the umbrella ofonline sampling, i.e

Introduction Random sampling from data streams is a fundamental operation in data analysis, particularly when dealing with massive datasets that exceed available memory or continuous streams of indeterminate length. The methods covered here fall under the umbrella ofonline sampling, i.e. algorithms that process stream elements sequentially in a single pas...

work page
[2]

Related Work Online sampling techniques have been implemented across a range of programming languages and frameworks, typically addressing specific use cases rather than providing a unified suite of algo- rithms. For instance, eBay’stsv-utilsprovide thetsv-sample command-line tool, which supports simple random sampling and weighted reservoir sampling over...

work page
[3]

sequential), the sampling scheme (with vs

Implemented Methods StreamSampling.jlcategorizes its implemented methods along three main axes: the underlying logic (reservoir vs. sequential), the sampling scheme (with vs. without replacement), and the inclusion probabilities (weighted vs. unweighted). For reservoir sampling without replacement,AlgRandAlgL

work page
[4]

StreamSampling.jl: Efficient Sampling from Data Streams in Julia

are provided, whereasAlgRSWRSKIP[11] covers the 1 arXiv:2603.21996v1 [cs.SE] 23 Mar 2026 with-replacement counterpart. For weighted reservoir sampling, AlgAResandAlgAExpJ[7] handle the without-replacement case, andAlgWRSWRSKIP[9] is used for the with-replacement case. In the sequential sampling domain, methods likeAlgD[16] and AlgHiddenShuffle[12] are emp...

work page internal anchor Pith review Pith/arXiv arXiv 2026
[5]

The package also providesitsample, which, similarly toStatsBase.sample, returns anArray, but, by using stream methods, it can be applied to any iterator

Sampling API The interface ofStreamSampling.jlrelies on two main sam- plers, each made for its sampling domain:ReservoirSampler andSequentialSampler. The package also providesitsample, which, similarly toStatsBase.sample, returns anArray, but, by using stream methods, it can be applied to any iterator. 4.1 Reservoir Samplers Following theOnlineStats.jl[6]...

work page
[6]

All bench- marks were run on a machine with an AMD Ryzen 5 5600H (6 cores) and 16GB of RAM, running Ubuntu 24.04 LTS and Julia 1.12

Benchmarks Several benchmarks were conducted to evaluate the performance of the reservoir and sequential sampling implementations1. All bench- marks were run on a machine with an AMD Ryzen 5 5600H (6 cores) and 16GB of RAM, running Ubuntu 24.04 LTS and Julia 1.12. To evaluate the algorithms’ core computational performance, we benchmarked them on a simple ...

work page
[7]

Applications The algorithms provided byStreamSampling.jlcan be applied across various data-intensive domains including approximate query 1The benchmarks and applications described are available at https://github.com/JuliaDynamics/StreamSampling.jl/tree/ main/benchmark. processing, online log monitoring, querying massive network ana- lytics feeds, and sens...

work page
[8]

Conclusion StreamSampling.jlintroduces a comprehensive suite of algo- rithms covering both reservoir and sequential paradigms, filling a gap in the Julia ecosystem. The ability to sample from any Julia iterator without materializing it in memory extends the reach of random sampling to contexts such as lazy data pipelines and large on-disk datasets. Future...

work page
[9]

Stratified reser- voir sampling over heterogeneous data streams

Mohammed Al-Kateb and Byung Suk Lee. Stratified reser- voir sampling over heterogeneous data streams. In Michael Gertz and Bertram Ludäscher, editors,Scientific and Statisti- cal Database Management, pages 621–639. Springer Berlin Heidelberg, 2010

work page 2010
[10]

Jon Louis Bentley and James B. Saxe. Generating sorted lists of random numbers.ACM Transactions on Mathematical Software, 6(3):359–364, 1980. doi:10.1145/355900.355907

work page doi:10.1145/355900.355907 1980
[11]

Jeff Bezanson, Alan Edelman, Stefan Karpinski, and Vi- ral B. Shah. Julia: A fresh approach to numerical computing. SIAM Review, 59(1):65–98, 2017. doi:10.1137/141000671. https://doi.org/10.1137/141000671

work page doi:10.1137/141000671 2017
[12]

Optimal sampling from sliding windows.Journal of Computer and System Sciences, 78(1):260–272, 2012

Vladimir Braverman, Rafail Ostrovsky, and Carlo Zan- iolo. Optimal sampling from sliding windows.Journal of Computer and System Sciences, 78(1):260–272, 2012. doi:https://doi.org/10.1016/j.jcss.2011.04.004. JCSS Knowl- edge Representation and Reasoning

work page doi:10.1016/j.jcss.2011.04.004 2012
[13]

Haas, and Chris Jermaine

Graham Cormode, Minos Garofalakis, Peter J. Haas, and Chris Jermaine. Synopses for massive data: Samples, his- tograms, wavelets, sketches.Foundations and Trends in Databases, 4(1-3):1–294, 2011. doi:10.1561/1900000004. https://www.emerald.com/ftdbs/article-pdf/4/1- 3/1/11024120/1900000004en.pdf

work page doi:10.1561/1900000004 2011
[14]

Onlinestats.jl: A julia package for statistics on data streams.Journal of Open Source Software, 5(46):1816, 2020

Josh Day and Hua Zhou. Onlinestats.jl: A julia package for statistics on data streams.Journal of Open Source Software, 5(46):1816, 2020. doi:10.21105/joss.01816

work page doi:10.21105/joss.01816 2020
[15]

Weighted random sampling with a reservoir

Pavlos S. Efraimidis and Paul G. Spirakis. Weighted random sampling with a reservoir.Infor- mation Processing Letters, 97(5):181–185, 2006. doi:https://doi.org/10.1016/j.ipl.2005.11.003

work page doi:10.1016/j.ipl.2005.11.003 2006
[16]

gnu coreutils.Free Software Founda- tion, Inc, 2022, 1994

David MacKenzie et al. gnu coreutils.Free Software Founda- tion, Inc, 2022, 1994

work page 2022
[17]

Weighted reser- voir sampling with replacement from data streams

Adriano Meligrana and Adriano Fazzone. Weighted reser- voir sampling with replacement from data streams. In Proceedings of the ACM on Web Conference 2026, New York, NY , USA, 2026. Association for Computing Machin- ery. doi:10.1145/3774904.3792966. Accepted for publica- tion. Currently available as arXiv preprint arXiv:2403.20256

work page doi:10.1145/3774904.3792966 2026
[18]

River: machine learning for streaming data in python.Journal of Machine Learning Research, 22:1–8,

Jacob Montiel, Max Halford, Saulo Martiello Mastelini, Ge- offrey Bolmier, Raphaël Sourty, Robin Vaysse, Adil Zoui- tine, Heitor Murilo Gomes, Jesse Read, Talel Abdessalem, and Albert Bifet. River: machine learning for streaming data in python.Journal of Machine Learning Research, 22:1–8,

work page
[19]

doi:10.48550/arXiv.2012.04740

work page doi:10.48550/arxiv.2012.04740 2012
[20]

Sam- atova

Byung-Hoon Park, George Ostrouchov, and Nagiza F. Sam- atova. Sampling streaming data with replacement.Compu- tational Statistics & Data Analysis, 52(2):750–762, 2007. doi:https://doi.org/10.1016/j.csda.2007.03.010

work page doi:10.1016/j.csda.2007.03.010 2007
[21]

Sequential ran- dom sampling revisited: Hidden shuffle method

Michael Shekelyan and Graham Cormode. Sequential ran- dom sampling revisited: Hidden shuffle method. In Arindam Banerjee and Kenji Fukumizu, editors,Proceedings of The 24th International Conference on Artificial Intelligence and Statistics, volume 130 ofProceedings of Machine Learning Research, pages 3628–3636. PMLR, 2021

work page 2021
[22]

An asymptotically optimal, online algorithm for weighted random sampling with replacement

Michal Startek. An asymptotically optimal, online algorithm for weighted random sampling with replacement.CoRR, abs/1611.00532, 2016. arXiv:1611.00532

work page internal anchor Pith review Pith/arXiv arXiv 2016
[23]

The big data ecosystem at linkedin

Roshan Sumbaly, Jay Kreps, and Sam Shah. The big data ecosystem at linkedin. InProceedings of the 2013 ACM SIG- MOD International Conference on Management of Data, SIGMOD ’13, page 1125–1134. Association for Computing Machinery, 2013. doi:10.1145/2463676.2463707

work page doi:10.1145/2463676.2463707 2013
[24]

Jeffrey S. Vitter. Random sampling with a reservoir.ACM Transactions on Mathematical Software, 11(1):37–57, 1985. doi:10.1145/3147.3165

work page doi:10.1145/3147.3165 1985
[25]

An efficient algorithm for sequential ran- dom sampling.ACM Transactions on Mathematical Software, 13(1):58–67, 1987

Jeffrey Scott Vitter. An efficient algorithm for sequential ran- dom sampling.ACM Transactions on Mathematical Software, 13(1):58–67, 1987. doi:10.1145/23002.23003

work page doi:10.1145/23002.23003 1987
[26]

Efficient trans- mission and reconstruction of dependent data streams via edge sampling

Joel Wolfrath and Abhishek Chandra. Efficient trans- mission and reconstruction of dependent data streams via edge sampling. In2022 IEEE International Confer- ence on Cloud Engineering (IC2E), pages 47–57, 2022. doi:10.1109/IC2E55432.2022.00013. 4

work page doi:10.1109/ic2e55432.2022.00013 2022

[1] [1]

The methods covered here fall under the umbrella ofonline sampling, i.e

Introduction Random sampling from data streams is a fundamental operation in data analysis, particularly when dealing with massive datasets that exceed available memory or continuous streams of indeterminate length. The methods covered here fall under the umbrella ofonline sampling, i.e. algorithms that process stream elements sequentially in a single pas...

work page

[2] [2]

Related Work Online sampling techniques have been implemented across a range of programming languages and frameworks, typically addressing specific use cases rather than providing a unified suite of algo- rithms. For instance, eBay’stsv-utilsprovide thetsv-sample command-line tool, which supports simple random sampling and weighted reservoir sampling over...

work page

[3] [3]

sequential), the sampling scheme (with vs

Implemented Methods StreamSampling.jlcategorizes its implemented methods along three main axes: the underlying logic (reservoir vs. sequential), the sampling scheme (with vs. without replacement), and the inclusion probabilities (weighted vs. unweighted). For reservoir sampling without replacement,AlgRandAlgL

work page

[4] [4]

StreamSampling.jl: Efficient Sampling from Data Streams in Julia

are provided, whereasAlgRSWRSKIP[11] covers the 1 arXiv:2603.21996v1 [cs.SE] 23 Mar 2026 with-replacement counterpart. For weighted reservoir sampling, AlgAResandAlgAExpJ[7] handle the without-replacement case, andAlgWRSWRSKIP[9] is used for the with-replacement case. In the sequential sampling domain, methods likeAlgD[16] and AlgHiddenShuffle[12] are emp...

work page internal anchor Pith review Pith/arXiv arXiv 2026

[5] [5]

The package also providesitsample, which, similarly toStatsBase.sample, returns anArray, but, by using stream methods, it can be applied to any iterator

Sampling API The interface ofStreamSampling.jlrelies on two main sam- plers, each made for its sampling domain:ReservoirSampler andSequentialSampler. The package also providesitsample, which, similarly toStatsBase.sample, returns anArray, but, by using stream methods, it can be applied to any iterator. 4.1 Reservoir Samplers Following theOnlineStats.jl[6]...

work page

[6] [6]

All bench- marks were run on a machine with an AMD Ryzen 5 5600H (6 cores) and 16GB of RAM, running Ubuntu 24.04 LTS and Julia 1.12

Benchmarks Several benchmarks were conducted to evaluate the performance of the reservoir and sequential sampling implementations1. All bench- marks were run on a machine with an AMD Ryzen 5 5600H (6 cores) and 16GB of RAM, running Ubuntu 24.04 LTS and Julia 1.12. To evaluate the algorithms’ core computational performance, we benchmarked them on a simple ...

work page

[7] [7]

Applications The algorithms provided byStreamSampling.jlcan be applied across various data-intensive domains including approximate query 1The benchmarks and applications described are available at https://github.com/JuliaDynamics/StreamSampling.jl/tree/ main/benchmark. processing, online log monitoring, querying massive network ana- lytics feeds, and sens...

work page

[8] [8]

Conclusion StreamSampling.jlintroduces a comprehensive suite of algo- rithms covering both reservoir and sequential paradigms, filling a gap in the Julia ecosystem. The ability to sample from any Julia iterator without materializing it in memory extends the reach of random sampling to contexts such as lazy data pipelines and large on-disk datasets. Future...

work page

[9] [9]

Stratified reser- voir sampling over heterogeneous data streams

Mohammed Al-Kateb and Byung Suk Lee. Stratified reser- voir sampling over heterogeneous data streams. In Michael Gertz and Bertram Ludäscher, editors,Scientific and Statisti- cal Database Management, pages 621–639. Springer Berlin Heidelberg, 2010

work page 2010

[10] [10]

Jon Louis Bentley and James B. Saxe. Generating sorted lists of random numbers.ACM Transactions on Mathematical Software, 6(3):359–364, 1980. doi:10.1145/355900.355907

work page doi:10.1145/355900.355907 1980

[11] [11]

Jeff Bezanson, Alan Edelman, Stefan Karpinski, and Vi- ral B. Shah. Julia: A fresh approach to numerical computing. SIAM Review, 59(1):65–98, 2017. doi:10.1137/141000671. https://doi.org/10.1137/141000671

work page doi:10.1137/141000671 2017

[12] [12]

Optimal sampling from sliding windows.Journal of Computer and System Sciences, 78(1):260–272, 2012

Vladimir Braverman, Rafail Ostrovsky, and Carlo Zan- iolo. Optimal sampling from sliding windows.Journal of Computer and System Sciences, 78(1):260–272, 2012. doi:https://doi.org/10.1016/j.jcss.2011.04.004. JCSS Knowl- edge Representation and Reasoning

work page doi:10.1016/j.jcss.2011.04.004 2012

[13] [13]

Haas, and Chris Jermaine

Graham Cormode, Minos Garofalakis, Peter J. Haas, and Chris Jermaine. Synopses for massive data: Samples, his- tograms, wavelets, sketches.Foundations and Trends in Databases, 4(1-3):1–294, 2011. doi:10.1561/1900000004. https://www.emerald.com/ftdbs/article-pdf/4/1- 3/1/11024120/1900000004en.pdf

work page doi:10.1561/1900000004 2011

[14] [14]

Onlinestats.jl: A julia package for statistics on data streams.Journal of Open Source Software, 5(46):1816, 2020

Josh Day and Hua Zhou. Onlinestats.jl: A julia package for statistics on data streams.Journal of Open Source Software, 5(46):1816, 2020. doi:10.21105/joss.01816

work page doi:10.21105/joss.01816 2020

[15] [15]

Weighted random sampling with a reservoir

Pavlos S. Efraimidis and Paul G. Spirakis. Weighted random sampling with a reservoir.Infor- mation Processing Letters, 97(5):181–185, 2006. doi:https://doi.org/10.1016/j.ipl.2005.11.003

work page doi:10.1016/j.ipl.2005.11.003 2006

[16] [16]

gnu coreutils.Free Software Founda- tion, Inc, 2022, 1994

David MacKenzie et al. gnu coreutils.Free Software Founda- tion, Inc, 2022, 1994

work page 2022

[17] [17]

Weighted reser- voir sampling with replacement from data streams

Adriano Meligrana and Adriano Fazzone. Weighted reser- voir sampling with replacement from data streams. In Proceedings of the ACM on Web Conference 2026, New York, NY , USA, 2026. Association for Computing Machin- ery. doi:10.1145/3774904.3792966. Accepted for publica- tion. Currently available as arXiv preprint arXiv:2403.20256

work page doi:10.1145/3774904.3792966 2026

[18] [18]

River: machine learning for streaming data in python.Journal of Machine Learning Research, 22:1–8,

Jacob Montiel, Max Halford, Saulo Martiello Mastelini, Ge- offrey Bolmier, Raphaël Sourty, Robin Vaysse, Adil Zoui- tine, Heitor Murilo Gomes, Jesse Read, Talel Abdessalem, and Albert Bifet. River: machine learning for streaming data in python.Journal of Machine Learning Research, 22:1–8,

work page

[19] [19]

doi:10.48550/arXiv.2012.04740

work page doi:10.48550/arxiv.2012.04740 2012

[20] [20]

Sam- atova

Byung-Hoon Park, George Ostrouchov, and Nagiza F. Sam- atova. Sampling streaming data with replacement.Compu- tational Statistics & Data Analysis, 52(2):750–762, 2007. doi:https://doi.org/10.1016/j.csda.2007.03.010

work page doi:10.1016/j.csda.2007.03.010 2007

[21] [21]

Sequential ran- dom sampling revisited: Hidden shuffle method

Michael Shekelyan and Graham Cormode. Sequential ran- dom sampling revisited: Hidden shuffle method. In Arindam Banerjee and Kenji Fukumizu, editors,Proceedings of The 24th International Conference on Artificial Intelligence and Statistics, volume 130 ofProceedings of Machine Learning Research, pages 3628–3636. PMLR, 2021

work page 2021

[22] [22]

An asymptotically optimal, online algorithm for weighted random sampling with replacement

Michal Startek. An asymptotically optimal, online algorithm for weighted random sampling with replacement.CoRR, abs/1611.00532, 2016. arXiv:1611.00532

work page internal anchor Pith review Pith/arXiv arXiv 2016

[23] [23]

The big data ecosystem at linkedin

Roshan Sumbaly, Jay Kreps, and Sam Shah. The big data ecosystem at linkedin. InProceedings of the 2013 ACM SIG- MOD International Conference on Management of Data, SIGMOD ’13, page 1125–1134. Association for Computing Machinery, 2013. doi:10.1145/2463676.2463707

work page doi:10.1145/2463676.2463707 2013

[24] [24]

Jeffrey S. Vitter. Random sampling with a reservoir.ACM Transactions on Mathematical Software, 11(1):37–57, 1985. doi:10.1145/3147.3165

work page doi:10.1145/3147.3165 1985

[25] [25]

An efficient algorithm for sequential ran- dom sampling.ACM Transactions on Mathematical Software, 13(1):58–67, 1987

Jeffrey Scott Vitter. An efficient algorithm for sequential ran- dom sampling.ACM Transactions on Mathematical Software, 13(1):58–67, 1987. doi:10.1145/23002.23003

work page doi:10.1145/23002.23003 1987

[26] [26]

Efficient trans- mission and reconstruction of dependent data streams via edge sampling

Joel Wolfrath and Abhishek Chandra. Efficient trans- mission and reconstruction of dependent data streams via edge sampling. In2022 IEEE International Confer- ence on Cloud Engineering (IC2E), pages 47–57, 2022. doi:10.1109/IC2E55432.2022.00013. 4

work page doi:10.1109/ic2e55432.2022.00013 2022