BlobShuffle: Cost-Effective Repartitioning in Stream Processing Systems via Object Storage Exemplified with Kafka Streams

Adriano Vogel; Otmar Ertl; S\"oren Henning

arxiv: 2606.03364 · v1 · pith:BGY5TZN5new · submitted 2026-06-02 · 💻 cs.DC · cs.DB· cs.PF· cs.SE

BlobShuffle: Cost-Effective Repartitioning in Stream Processing Systems via Object Storage Exemplified with Kafka Streams

S\"oren Henning , Otmar Ertl , Adriano Vogel This is my paper

Pith reviewed 2026-06-28 08:29 UTC · model grok-4.3

classification 💻 cs.DC cs.DBcs.PFcs.SE

keywords stream processingdata shufflingrepartitioningcloud object storageKafka Streamscost efficiencydistributed caching

0 comments

The pith

BlobShuffle routes stream shuffling through cloud object storage to cut costs by more than 40 times.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces BlobShuffle as a method to perform data shuffling in stream processing by storing batches of records in cloud object storage and sending only notifications. This replaces direct network transfers that incur high costs from cross-availability zone traffic. If the approach works, stream processing frameworks can support stateful workloads more affordably while maintaining low latency and correctness. Readers should care because current shuffling operations are a major expense and operational challenge in cloud deployments. The evaluation shows scalability to over 2 GiB/s with latency under 2 seconds at the 95th percentile.

Core claim

BlobShuffle groups records into batches, stores them in cloud object storage, and forwards compact notifications to downstream operators who retrieve the batches using distributed caching. This yields more than 40x lower shuffling costs than native Kafka Streams while keeping 95th percentile latency below 2 seconds and scaling to more than 2 GiB/s.

What carries the argument

Batching records for storage in cloud object storage combined with notification forwarding and distributed caching to balance cost and latency.

If this is right

Shuffling no longer requires operating a high-throughput messaging backbone.
Existing Kafka Streams applications need only minimal code changes.
Cost efficiency allows shuffle-intensive workloads at large scale.
Consistency and correctness guarantees remain intact.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The technique could extend to other cloud-based stream processors facing similar repartitioning costs.
It might encourage greater use of object storage as an exchange layer in distributed systems.
Testing at even higher throughputs could reveal if scalability limits appear beyond 2 GiB/s.

Load-bearing premise

The assumption that using object storage for batches and notifications maintains the original system's consistency and correctness without adding failure modes or exceeding latency targets.

What would settle it

An experiment measuring shuffling costs and 95th percentile latency at scale that fails to show at least 40x cost reduction or exceeds 2 seconds latency.

Figures

Figures reproduced from arXiv: 2606.03364 by Adriano Vogel, Otmar Ertl, S\"oren Henning.

**Figure 2.** Figure 2: Overview of BlobShuffle’s Batcher operator. Incoming records are grouped into batches, uploaded to the cloud object [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Overview of BlobShuffle’s Debatcher operator. Received notifications reference batches, which are retrieved from [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 4.** Figure 4: Caching mechanisms in BlobShuffle’s read path with and without an additional local cache layer. [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗

**Figure 5.** Figure 5: Latency distributions for BlobShuffle (24 Kafka Streams instances on 12 nodes with 16 MiB batches). [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

**Figure 6.** Figure 6: Impact of the target batch size on performance and costs for BlobShuffle (24 Kafka Streams instances on 12 nodes). [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗

**Figure 7.** Figure 7: Latency of shuffling via BlobShuffle for different [PITH_FULL_IMAGE:figures/full_fig_p010_7.png] view at source ↗

**Figure 8.** Figure 8: Impact of scaling the number of partitions on performance for BlobShuffle (12 nodes, 16 MiB batches). [PITH_FULL_IMAGE:figures/full_fig_p011_8.png] view at source ↗

**Figure 9.** Figure 9: Impact of scaling the cluster size on performance for BlobShuffle (16 MiB batches). [PITH_FULL_IMAGE:figures/full_fig_p011_9.png] view at source ↗

read the original abstract

Shuffling or repartitioning data streams is an essential operation of state-of-the-art stream processing frameworks to support stateful workloads in a large-scale, distributed setting. In today's cloud deployments, however, shuffling can become a major cost driver due to substantial network traffic across multiple availability zones (AZs) as well as an operational burden when operating a high-throughput, strongly consistent messaging backbone at scale. We present BlobShuffle, a novel approach to cost-effective shuffling for stream processing systems that leverages cloud object storage as an intermediate exchange layer. Instead of sending all shuffled records directly, BlobShuffle groups records into batches, stores these batches in cloud object storage, and forwards only compact notifications. Downstream operators use these notifications to retrieve the relevant batches and extract the corresponding records. BlobShuffle balances cost efficiency and latency through configurable batching and a distributed caching mechanism. BlobShuffle is implemented as an add-on for Kafka Streams that requires only minimal code changes to existing applications, leaves Kafka and the underlying infrastructure unmodified, and preserves Kafka Streams' consistency and correctness guarantees. In a large-scale experimental evaluation on a Kubernetes-based AWS deployment, we show that BlobShuffle can reduce shuffling costs by more than 40x compared to native Kafka Streams shuffling while keeping the 95th percentile shuffle latency below 2 seconds. Moreover, it scales to processing more than 2 GiB/s without encountering a scalability limit in our experiments, indicating that BlobShuffle can economically support shuffle-intensive workloads at large scale.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

BlobShuffle gives a practical object-storage route for Kafka Streams shuffling that cuts costs sharply in the reported experiments, though the consistency story under failures needs tighter evidence.

read the letter

BlobShuffle routes shuffled records through cloud object storage batches plus notifications instead of direct network or Kafka traffic. The add-on keeps changes to existing apps small and leaves the messaging layer untouched.

The experiments are the clearest strength. On Kubernetes with AWS they report more than 40x lower shuffling cost, p95 latency still under 2 seconds, and throughput above 2 GiB/s with no scalability wall hit. Those numbers address a real expense in cross-AZ cloud deployments.

The soft spot is the preservation of correctness. The design adds upload steps, notification delivery, and a distributed cache on top of eventually consistent storage. The abstract states that guarantees stay the same and no new failure modes appear, but the handling of partial uploads, lost notifications, or cache misses is not shown in enough detail to confirm equivalence with native repartitioning. A few more failure-injection cases or pseudocode for the recovery paths would close that gap.

The work is aimed at practitioners running large stateful Kafka Streams jobs where network bills dominate and at systems researchers looking for concrete cost-latency trade-offs. It is worth a serious referee because the implementation is real, the measurements are concrete, and the central engineering claim is testable even if the robustness section needs expansion.

Referee Report

1 major / 0 minor

Summary. The paper presents BlobShuffle, an add-on for Kafka Streams that uses cloud object storage to batch records for shuffling/repartitioning, stores batches in object storage, and forwards only compact notifications to downstream operators (who retrieve via a distributed cache). It claims this yields >40x lower shuffling costs than native Kafka Streams, p95 shuffle latency <2s, and throughput >2 GiB/s on a Kubernetes AWS deployment, while requiring only minimal application changes, leaving Kafka unmodified, and preserving consistency/correctness guarantees.

Significance. If the experimental outcomes and correctness preservation hold, the work addresses a real cost and operational pain point in cloud stream processing by substituting cross-AZ network traffic with object-storage I/O plus notifications. The concrete 40x cost figure, latency bound, and scale result (if reproducible) would be a useful data point for practitioners; the design choice of configurable batching plus caching is a pragmatic way to trade latency for cost.

major comments (1)

[Abstract] Abstract and evaluation section: the central claim that the batch-upload/notification/cache design 'preserves Kafka Streams' consistency and correctness guarantees' without new failure modes is load-bearing for the 'no Kafka changes' benefit, yet the provided description gives no concrete failure-handling protocol, consistency model argument, or fault-injection results for upload failures, notification loss, or cache-coherence races; this leaves the 40x cost and latency claims without a verified correctness foundation.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive review and for underscoring the need for explicit correctness arguments. We address the single major comment below.

read point-by-point responses

Referee: [Abstract] Abstract and evaluation section: the central claim that the batch-upload/notification/cache design 'preserves Kafka Streams' consistency and correctness guarantees' without new failure modes is load-bearing for the 'no Kafka changes' benefit, yet the provided description gives no concrete failure-handling protocol, consistency model argument, or fault-injection results for upload failures, notification loss, or cache-coherence races; this leaves the 40x cost and latency claims without a verified correctness foundation.

Authors: We agree that the current manuscript provides only a high-level claim without a concrete failure-handling protocol or supporting arguments. The design intends to inherit Kafka's exactly-once delivery for notifications and to treat object-storage uploads as idempotent operations with client-side retries, but these mechanisms are not spelled out. We will therefore add a dedicated subsection (approximately one page) in the system-design section that (1) enumerates the failure cases, (2) describes the retry and compensation logic, (3) argues why no new consistency violations are introduced relative to native Kafka Streams, and (4) reports the results of targeted fault-injection experiments. Corresponding text will also be added to the abstract and evaluation summary. These changes will appear in the revised version. revision: yes

Circularity Check

0 steps flagged

No circularity; claims rest on direct experimental measurements

full rationale

The paper presents BlobShuffle as an implemented system evaluated via large-scale experiments on AWS/Kubernetes. The core claims (40x cost reduction, p95 latency <2s, >2 GiB/s throughput) are reported as measured outcomes, not derived from equations, fitted parameters, or self-referential definitions. No mathematical derivation chain, uniqueness theorems, or ansatzes appear in the abstract or described approach. The consistency-preservation argument is an engineering claim supported by the implementation description rather than a reduction to prior self-citations. This matches the expected non-circular case for an empirical systems paper.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The approach rests on standard assumptions about cloud object storage durability and stream-processing semantics; no free parameters or invented entities are explicitly introduced beyond configuration choices.

axioms (1)

domain assumption Cloud object storage provides sufficient durability, availability, and consistency for use as an intermediate exchange layer in stream processing.
Implicit in the design that relies on object storage without additional verification mechanisms mentioned.

pith-pipeline@v0.9.1-grok · 5813 in / 1182 out tokens · 33483 ms · 2026-06-28T08:29:27.497938+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

34 extracted references · 23 canonical work pages

[1]

Aiven. 2026. Diskless for Apache Kafka ® (BYOC). https://aiven.io/inkless

2026
[2]

Michael Armbrust, Tathagata Das, Liwen Sun, Burak Yavuz, Shixiong Zhu, Mukul Murthy, Joseph Torres, Herman van Hovell, Adrian Ionescu, Alicja BlobShuffle: Cost-Effective Repartitioning in Stream Processing Systems via Object Storage Exemplified with Kafka Streams Łuszczak, Michał Świtakowski, Michał Szafrański, Xiao Li, Takuya Ueshin, Mostafa Mokhtar, Pet...

work page doi:10.14778/3415478.3415560 2020
[3]

AutoMQ. 2026. AutoMQ | Kafka ® Compatible Cloud-Native Data Streaming Platform. https://www.automq.com/

2026
[4]

Matthias Brantner, Daniela Florescu, David Graf, Donald Kossmann, and Tim Kraska. 2008. Building a database on S3. In Proceedings of the 2008 ACM SIGMOD International Conference on Management of Data (Vancouver, Canada) (SIGMOD ’08). Association for Computing Machinery, New York, NY, USA, 251–264. https: //doi.org/10.1145/1376616.1376645

work page doi:10.1145/1376616.1376645 2008
[5]

Confluent. 2026. WarpStream – The Diskless, Kafka-Compatible Data Streaming Platform. https://www.warpstream.com/

2026
[6]

Lee, Ashish Motivala, Abdul Q

Benoit Dageville, Thierry Cruanes, Marcin Zukowski, Vadim Antonov, Artin Avanes, Jon Bock, Jonathan Claybaugh, Daniel Engovatov, Martin Hentschel, Jiansheng Huang, Allison W. Lee, Ashish Motivala, Abdul Q. Munir, Steven Pelley, Peter Povinec, Greg Rahn, Spyridon Triantafyllis, and Philipp Unterbrunner. 2016. The Snowflake Elastic Data Warehouse. In Procee...

work page doi:10.1145/2882903.2903741 2016
[7]

Jeffrey Dean and Sanjay Ghemawat. 2008. MapReduce: Simplified Data Pro- cessing on Large Clusters. Commun. ACM 51, 1 (Jan. 2008), 107–113. https: //doi.org/10.1145/1327452.1327492

work page doi:10.1145/1327452.1327492 2008
[8]

Dominik Durner, Viktor Leis, and Thomas Neumann. 2023. Exploiting Cloud Object Storage for High-Performance Analytics. Proc. VLDB Endow. 16, 11 (July 2023), 2769–2782. https://doi.org/10.14778/3611479.3611486

work page doi:10.14778/3611479.3611486 2023
[9]

Marios Fragkoulis, Paris Carbone, Vasiliki Kalavri, and Asterios Katsifodimos
[10]

https://doi.org/10.1007/s00778-023-00819-8

A survey on the evolution of stream processing systems.The VLDB Journal 33, 2 (2024), 507–541. https://doi.org/10.1007/s00778-023-00819-8

work page doi:10.1007/s00778-023-00819-8 2024
[11]

Greg Harris, Ivan Yurchenko, Jorge Quilcate, Giuseppe Lillo, Anatolii Popov, Juha Mynttinen, Josep Prat, and Filip Yonov. 2025. Kafka Improvement Proposals: KIP- 1150: Diskless Topics. https://cwiki.apache.org/confluence/display/KAFKA/KIP- 1150%3A+Diskless+Topics

2025
[12]

Michael Haubenschild and Viktor Leis. 2025. OLTP in the Cloud: Architectures, Tradeoffs, and Cost. The VLDB Journal 34, 4 (2025), 42. https://doi.org/10.1007/ s00778-025-00913-z

2025
[13]

Sören Henning and Wilhelm Hasselbring. 2021. How to Measure Scalability of Distributed Stream Processing Engines?. In Companion of the ACM/SPEC International Conference on Performance Engineering . ACM. https://doi.org/10. 1145/3447545.3451190

arXiv 2021
[14]

Sören Henning and Wilhelm Hasselbring. 2022. A Configurable Method for Benchmarking Scalability of Cloud-Native Applications. Empirical Software Engineering 27, 6 (Aug. 2022). https://doi.org/10.1007/s10664-022-10162-1

work page doi:10.1007/s10664-022-10162-1 2022
[15]

Sören Henning and Wilhelm Hasselbring. 2024. Benchmarking scalability of stream processing frameworks deployed as microservices in the cloud.Journal of Systems and Software 208 (2024), 111879. https://doi.org/10.1016/j.jss.2023.111879

work page doi:10.1016/j.jss.2023.111879 2024
[16]

Sören Henning, Adriano Vogel, and Otmar Ertl. 2026. BlobShuffle. https: //github.com/dynatrace-research/BlobShuffle

2026
[17]

Sören Henning, Adriano Vogel, Michael Leichtfried, Otmar Ertl, and Rick Rabiser
[18]

InProceedings of the 15th ACM/SPEC International Conference on Performance Engineering (London, United Kingdom) (ICPE ’24)

ShuffleBench: A Benchmark for Large-Scale Data Shuffling Operations with Distributed Stream Processing Frameworks. InProceedings of the 15th ACM/SPEC International Conference on Performance Engineering (London, United Kingdom) (ICPE ’24). ACM, 2–13. https://doi.org/10.1145/3629526.3645036

work page doi:10.1145/3629526.3645036
[19]

Sören Henning, Adriano Vogel, Esteban Perez-Wohlfeil, Otmar Ertl, and Rick Rabiser. 2025. When Should I Run My Application Benchmark? Studying Cloud Performance Variability for the Case of Stream Processing Applications. In Proceedings of the 33rd ACM International Conference on the Foundations of Soft- ware Engineering (Clarion Hotel Trondheim, Trondheim...

work page doi:10.1145/3696630.3728563 2025
[20]

Rodrigo Laigner, Ana Carolina Almeida, Wesley K. G. Assunção, and Yongluan Zhou. 2025. An Empirical Study on Challenges of Event Management in Microservice Architectures. ACM Trans. Softw. Eng. Methodol. (Dec. 2025). https://doi.org/10.1145/3776581

work page doi:10.1145/3776581 2025
[21]

Frank Sifei Luan, Stephanie Wang, Samyukta Yagati, Sean Kim, Kenneth Lien, Isaac Ong, Tony Hong, Sangbin Cho, Eric Liang, and Ion Stoica. 2023. Exoshuffle: An Extensible Shuffle Architecture. In Proceedings of the ACM SIGCOMM 2023 Conference (New York, NY, USA) (ACM SIGCOMM ’23). Association for Comput- ing Machinery, New York, NY, USA, 564–577. https://d...

work page doi:10.1145/3603269 2023
[22]

Yuan Mei, Rui Xia, Zhaoqian Lan, Kaitian Hu, Lei Huang, Paris Carbone, Yanfei Lei, Vasiliki Kalavri, Han Yin, and Feng Wang. 2025. Disaggregated State Man- agement in Apache Flink® 2.0. Proc. VLDB Endow. 18, 12 (Aug. 2025), 4846–4859. https://doi.org/10.14778/3750601.3750609

work page doi:10.14778/3750601.3750609 2025
[23]

Matteo Merli, Sijie Guo, Penghui Li, Hang Chen, and Neng Lu. 2025. Ursa: A Lakehouse-Native Data Streaming Engine for Kafka. Proc. VLDB Endow. 18, 12 (Aug. 2025), 5184–5196. https://doi.org/10.14778/3750601.3750636

work page doi:10.14778/3750601.3750636 2025
[24]

Nikolaos Nikitas, Ioannis Konstantinou, Vana Kalogeraki, and Nectarios Koziris
[25]

In 2021 IEEE International Conference on Big Data (Big Data)

Cherry: A Distributed Task-Aware Shuffle Service for Serverless Analytics. In 2021 IEEE International Conference on Big Data (Big Data) . 120–130. https: //doi.org/10.1109/BigData52589.2021.9671899

work page doi:10.1109/bigdata52589.2021.9671899 2021
[26]

Tobias Pfandzelter, Sören Henning, Trever Schirmer, Wilhelm Hasselbring, and David Bermbach. 2022. Streaming vs. Functions: A Cost Perspective on Cloud Event Processing. In 2022 IEEE International Conference on Cloud Engineering (IC2E). IEEE, 67–78. https://doi.org/10.1109/IC2E55432.2022.00015

work page doi:10.1109/ic2e55432.2022.00015 2022
[27]

Sax, Guozhang Wang, Matthias Weidlich, and Johann-Christoph Freytag

Matthias J. Sax, Guozhang Wang, Matthias Weidlich, and Johann-Christoph Freytag. 2018. Streams and Tables: Two Sides of the Same Coin. In Proceedings of the International Workshop on Real-Time Business Intelligence and Analytics (BIRTE ’18). ACM, 10. https://doi.org/10.1145/3242153.3242155

work page doi:10.1145/3242153.3242155 2018
[28]

Min Shen, Ye Zhou, and Chandni Singh. 2020. Magnet: push-based shuffle service for large-scale data processing. Proc. VLDB Endow. 13, 12 (Aug. 2020), 3382–3395. https://doi.org/10.14778/3415478.3415558

work page doi:10.14778/3415478.3415558 2020
[29]

Alexandre Verbitski, Anurag Gupta, Debanjan Saha, Murali Brahmadesam, Kamal Gupta, Raman Mittal, Sailesh Krishnamurthy, Sandor Maurice, Tengiz Kharatishvili, and Xiaofeng Bao. 2017. Amazon Aurora: Design Considerations for High Throughput Cloud-Native Relational Databases. In Proceedings of the 2017 ACM International Conference on Management of Data (Chic...

work page doi:10.1145/3035918.3056101 2017
[30]

Adriano Vogel, Sören Henning, Esteban Perez-Wohlfeil, Otmar Ertl, and Rick Rabiser. 2024. A Comprehensive Benchmarking Analysis of Fault Recovery in Stream Processing Frameworks. In Proceedings of the 18th ACM International Conference on Distributed and Event-Based Systems (Villeurbanne, France) (DEBS ’24). ACM, 171–182. https://doi.org/10.1145/3629104.3666040

work page doi:10.1145/3629104.3666040 2024
[31]

Sax, John Roesler, Sophie Blee-Goldman, Bruno Cadonna, Apurva Mehta, Varun Madan, and Jun Rao

Guozhang Wang, Lei Chen, Ayusman Dikshit, Jason Gustafson, Boyang Chen, Matthias J. Sax, John Roesler, Sophie Blee-Goldman, Bruno Cadonna, Apurva Mehta, Varun Madan, and Jun Rao. 2021. Consistency and Completeness: Re- thinking Distributed Stream Processing in Apache Kafka. In Proceedings of the 2021 International Conference on Management of Data (SIGMOD/...

work page doi:10.1145/3448016.3457556 2021
[32]

Wang Yue, Martin Boissier, and Tilmann Rabl. 2024. A Survey of Stream Process- ing System Benchmarks. In Performance Evaluation and Benchmarking: 16th TPC Technology Conference, TPCTC 2024, Guangzhou, China, August 30, 2024, Revised Selected Papers (Guangzhou, China). Springer-Verlag, Berlin, Heidelberg, 24–43. https://doi.org/10.1007/978-3-031-93858-0_2

work page doi:10.1007/978-3-031-93858-0_2 2024
[33]

Freedman

Haoyu Zhang, Brian Cho, Ergin Seyfe, Avery Ching, and Michael J. Freedman
[34]

InProceedings of the Thirteenth EuroSys Conference (Porto, Portugal) (EuroSys ’18)

Riffle: optimized shuffle service for large-scale data analytics. InProceedings of the Thirteenth EuroSys Conference (Porto, Portugal) (EuroSys ’18). Association for Computing Machinery, New York, NY, USA, Article 43, 15 pages. https: //doi.org/10.1145/3190508.3190534

work page doi:10.1145/3190508.3190534

[1] [1]

Aiven. 2026. Diskless for Apache Kafka ® (BYOC). https://aiven.io/inkless

2026

[2] [2]

Michael Armbrust, Tathagata Das, Liwen Sun, Burak Yavuz, Shixiong Zhu, Mukul Murthy, Joseph Torres, Herman van Hovell, Adrian Ionescu, Alicja BlobShuffle: Cost-Effective Repartitioning in Stream Processing Systems via Object Storage Exemplified with Kafka Streams Łuszczak, Michał Świtakowski, Michał Szafrański, Xiao Li, Takuya Ueshin, Mostafa Mokhtar, Pet...

work page doi:10.14778/3415478.3415560 2020

[3] [3]

AutoMQ. 2026. AutoMQ | Kafka ® Compatible Cloud-Native Data Streaming Platform. https://www.automq.com/

2026

[4] [4]

Matthias Brantner, Daniela Florescu, David Graf, Donald Kossmann, and Tim Kraska. 2008. Building a database on S3. In Proceedings of the 2008 ACM SIGMOD International Conference on Management of Data (Vancouver, Canada) (SIGMOD ’08). Association for Computing Machinery, New York, NY, USA, 251–264. https: //doi.org/10.1145/1376616.1376645

work page doi:10.1145/1376616.1376645 2008

[5] [5]

Confluent. 2026. WarpStream – The Diskless, Kafka-Compatible Data Streaming Platform. https://www.warpstream.com/

2026

[6] [6]

Lee, Ashish Motivala, Abdul Q

Benoit Dageville, Thierry Cruanes, Marcin Zukowski, Vadim Antonov, Artin Avanes, Jon Bock, Jonathan Claybaugh, Daniel Engovatov, Martin Hentschel, Jiansheng Huang, Allison W. Lee, Ashish Motivala, Abdul Q. Munir, Steven Pelley, Peter Povinec, Greg Rahn, Spyridon Triantafyllis, and Philipp Unterbrunner. 2016. The Snowflake Elastic Data Warehouse. In Procee...

work page doi:10.1145/2882903.2903741 2016

[7] [7]

Jeffrey Dean and Sanjay Ghemawat. 2008. MapReduce: Simplified Data Pro- cessing on Large Clusters. Commun. ACM 51, 1 (Jan. 2008), 107–113. https: //doi.org/10.1145/1327452.1327492

work page doi:10.1145/1327452.1327492 2008

[8] [8]

Dominik Durner, Viktor Leis, and Thomas Neumann. 2023. Exploiting Cloud Object Storage for High-Performance Analytics. Proc. VLDB Endow. 16, 11 (July 2023), 2769–2782. https://doi.org/10.14778/3611479.3611486

work page doi:10.14778/3611479.3611486 2023

[9] [9]

Marios Fragkoulis, Paris Carbone, Vasiliki Kalavri, and Asterios Katsifodimos

[10] [10]

https://doi.org/10.1007/s00778-023-00819-8

A survey on the evolution of stream processing systems.The VLDB Journal 33, 2 (2024), 507–541. https://doi.org/10.1007/s00778-023-00819-8

work page doi:10.1007/s00778-023-00819-8 2024

[11] [11]

Greg Harris, Ivan Yurchenko, Jorge Quilcate, Giuseppe Lillo, Anatolii Popov, Juha Mynttinen, Josep Prat, and Filip Yonov. 2025. Kafka Improvement Proposals: KIP- 1150: Diskless Topics. https://cwiki.apache.org/confluence/display/KAFKA/KIP- 1150%3A+Diskless+Topics

2025

[12] [12]

Michael Haubenschild and Viktor Leis. 2025. OLTP in the Cloud: Architectures, Tradeoffs, and Cost. The VLDB Journal 34, 4 (2025), 42. https://doi.org/10.1007/ s00778-025-00913-z

2025

[13] [13]

Sören Henning and Wilhelm Hasselbring. 2021. How to Measure Scalability of Distributed Stream Processing Engines?. In Companion of the ACM/SPEC International Conference on Performance Engineering . ACM. https://doi.org/10. 1145/3447545.3451190

arXiv 2021

[14] [14]

Sören Henning and Wilhelm Hasselbring. 2022. A Configurable Method for Benchmarking Scalability of Cloud-Native Applications. Empirical Software Engineering 27, 6 (Aug. 2022). https://doi.org/10.1007/s10664-022-10162-1

work page doi:10.1007/s10664-022-10162-1 2022

[15] [15]

Sören Henning and Wilhelm Hasselbring. 2024. Benchmarking scalability of stream processing frameworks deployed as microservices in the cloud.Journal of Systems and Software 208 (2024), 111879. https://doi.org/10.1016/j.jss.2023.111879

work page doi:10.1016/j.jss.2023.111879 2024

[16] [16]

Sören Henning, Adriano Vogel, and Otmar Ertl. 2026. BlobShuffle. https: //github.com/dynatrace-research/BlobShuffle

2026

[17] [17]

Sören Henning, Adriano Vogel, Michael Leichtfried, Otmar Ertl, and Rick Rabiser

[18] [18]

InProceedings of the 15th ACM/SPEC International Conference on Performance Engineering (London, United Kingdom) (ICPE ’24)

ShuffleBench: A Benchmark for Large-Scale Data Shuffling Operations with Distributed Stream Processing Frameworks. InProceedings of the 15th ACM/SPEC International Conference on Performance Engineering (London, United Kingdom) (ICPE ’24). ACM, 2–13. https://doi.org/10.1145/3629526.3645036

work page doi:10.1145/3629526.3645036

[19] [19]

Sören Henning, Adriano Vogel, Esteban Perez-Wohlfeil, Otmar Ertl, and Rick Rabiser. 2025. When Should I Run My Application Benchmark? Studying Cloud Performance Variability for the Case of Stream Processing Applications. In Proceedings of the 33rd ACM International Conference on the Foundations of Soft- ware Engineering (Clarion Hotel Trondheim, Trondheim...

work page doi:10.1145/3696630.3728563 2025

[20] [20]

Rodrigo Laigner, Ana Carolina Almeida, Wesley K. G. Assunção, and Yongluan Zhou. 2025. An Empirical Study on Challenges of Event Management in Microservice Architectures. ACM Trans. Softw. Eng. Methodol. (Dec. 2025). https://doi.org/10.1145/3776581

work page doi:10.1145/3776581 2025

[21] [21]

Frank Sifei Luan, Stephanie Wang, Samyukta Yagati, Sean Kim, Kenneth Lien, Isaac Ong, Tony Hong, Sangbin Cho, Eric Liang, and Ion Stoica. 2023. Exoshuffle: An Extensible Shuffle Architecture. In Proceedings of the ACM SIGCOMM 2023 Conference (New York, NY, USA) (ACM SIGCOMM ’23). Association for Comput- ing Machinery, New York, NY, USA, 564–577. https://d...

work page doi:10.1145/3603269 2023

[22] [22]

Yuan Mei, Rui Xia, Zhaoqian Lan, Kaitian Hu, Lei Huang, Paris Carbone, Yanfei Lei, Vasiliki Kalavri, Han Yin, and Feng Wang. 2025. Disaggregated State Man- agement in Apache Flink® 2.0. Proc. VLDB Endow. 18, 12 (Aug. 2025), 4846–4859. https://doi.org/10.14778/3750601.3750609

work page doi:10.14778/3750601.3750609 2025

[23] [23]

Matteo Merli, Sijie Guo, Penghui Li, Hang Chen, and Neng Lu. 2025. Ursa: A Lakehouse-Native Data Streaming Engine for Kafka. Proc. VLDB Endow. 18, 12 (Aug. 2025), 5184–5196. https://doi.org/10.14778/3750601.3750636

work page doi:10.14778/3750601.3750636 2025

[24] [24]

Nikolaos Nikitas, Ioannis Konstantinou, Vana Kalogeraki, and Nectarios Koziris

[25] [25]

In 2021 IEEE International Conference on Big Data (Big Data)

Cherry: A Distributed Task-Aware Shuffle Service for Serverless Analytics. In 2021 IEEE International Conference on Big Data (Big Data) . 120–130. https: //doi.org/10.1109/BigData52589.2021.9671899

work page doi:10.1109/bigdata52589.2021.9671899 2021

[26] [26]

Tobias Pfandzelter, Sören Henning, Trever Schirmer, Wilhelm Hasselbring, and David Bermbach. 2022. Streaming vs. Functions: A Cost Perspective on Cloud Event Processing. In 2022 IEEE International Conference on Cloud Engineering (IC2E). IEEE, 67–78. https://doi.org/10.1109/IC2E55432.2022.00015

work page doi:10.1109/ic2e55432.2022.00015 2022

[27] [27]

Sax, Guozhang Wang, Matthias Weidlich, and Johann-Christoph Freytag

Matthias J. Sax, Guozhang Wang, Matthias Weidlich, and Johann-Christoph Freytag. 2018. Streams and Tables: Two Sides of the Same Coin. In Proceedings of the International Workshop on Real-Time Business Intelligence and Analytics (BIRTE ’18). ACM, 10. https://doi.org/10.1145/3242153.3242155

work page doi:10.1145/3242153.3242155 2018

[28] [28]

Min Shen, Ye Zhou, and Chandni Singh. 2020. Magnet: push-based shuffle service for large-scale data processing. Proc. VLDB Endow. 13, 12 (Aug. 2020), 3382–3395. https://doi.org/10.14778/3415478.3415558

work page doi:10.14778/3415478.3415558 2020

[29] [29]

Alexandre Verbitski, Anurag Gupta, Debanjan Saha, Murali Brahmadesam, Kamal Gupta, Raman Mittal, Sailesh Krishnamurthy, Sandor Maurice, Tengiz Kharatishvili, and Xiaofeng Bao. 2017. Amazon Aurora: Design Considerations for High Throughput Cloud-Native Relational Databases. In Proceedings of the 2017 ACM International Conference on Management of Data (Chic...

work page doi:10.1145/3035918.3056101 2017

[30] [30]

Adriano Vogel, Sören Henning, Esteban Perez-Wohlfeil, Otmar Ertl, and Rick Rabiser. 2024. A Comprehensive Benchmarking Analysis of Fault Recovery in Stream Processing Frameworks. In Proceedings of the 18th ACM International Conference on Distributed and Event-Based Systems (Villeurbanne, France) (DEBS ’24). ACM, 171–182. https://doi.org/10.1145/3629104.3666040

work page doi:10.1145/3629104.3666040 2024

[31] [31]

Sax, John Roesler, Sophie Blee-Goldman, Bruno Cadonna, Apurva Mehta, Varun Madan, and Jun Rao

Guozhang Wang, Lei Chen, Ayusman Dikshit, Jason Gustafson, Boyang Chen, Matthias J. Sax, John Roesler, Sophie Blee-Goldman, Bruno Cadonna, Apurva Mehta, Varun Madan, and Jun Rao. 2021. Consistency and Completeness: Re- thinking Distributed Stream Processing in Apache Kafka. In Proceedings of the 2021 International Conference on Management of Data (SIGMOD/...

work page doi:10.1145/3448016.3457556 2021

[32] [32]

Wang Yue, Martin Boissier, and Tilmann Rabl. 2024. A Survey of Stream Process- ing System Benchmarks. In Performance Evaluation and Benchmarking: 16th TPC Technology Conference, TPCTC 2024, Guangzhou, China, August 30, 2024, Revised Selected Papers (Guangzhou, China). Springer-Verlag, Berlin, Heidelberg, 24–43. https://doi.org/10.1007/978-3-031-93858-0_2

work page doi:10.1007/978-3-031-93858-0_2 2024

[33] [33]

Freedman

Haoyu Zhang, Brian Cho, Ergin Seyfe, Avery Ching, and Michael J. Freedman

[34] [34]

InProceedings of the Thirteenth EuroSys Conference (Porto, Portugal) (EuroSys ’18)

Riffle: optimized shuffle service for large-scale data analytics. InProceedings of the Thirteenth EuroSys Conference (Porto, Portugal) (EuroSys ’18). Association for Computing Machinery, New York, NY, USA, Article 43, 15 pages. https: //doi.org/10.1145/3190508.3190534

work page doi:10.1145/3190508.3190534