Characterizing and Fixing Silent Data Loss in Spark-on-AWS-Lambda with Open Table Formats

Srujan Kumar Gandla

arxiv: 2604.20081 · v1 · submitted 2026-04-22 · 💻 cs.DC

Characterizing and Fixing Silent Data Loss in Spark-on-AWS-Lambda with Open Table Formats

Srujan Kumar Gandla This is my paper

Pith reviewed 2026-05-09 23:43 UTC · model grok-4.3

classification 💻 cs.DC

keywords Spark-on-AWS-Lambdasilent data lossDelta LakeApache IcebergSIGKILLwatchdog threadopen table formatsrollback

0 comments

The pith

SafeWriter prevents silent data loss from Lambda SIGKILLs in Spark table writes by forcing clean rollbacks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that Spark-on-AWS-Lambda writes to open table formats suffer silent data loss when a timeout SIGKILL interrupts the gap between uploading data files and committing metadata. Through hundreds of controlled experiments on Delta Lake and Apache Iceberg, it demonstrates this failure occurs in every case where the signal arrives in that window. SafeWriter addresses the issue by wrapping the write process with a watchdog thread that initiates a format-native rollback and checkpoint 30 seconds before timeout, ensuring all failures are detectable and the table remains consistent.

Core claim

AWS Lambda's uncatchable SIGKILL on timeout, when landing between the data upload and metadata commit phases of a Spark write, leaves orphaned files on S3 while the table state is unchanged. SafeWriter, by arming a watchdog thread to trigger SQL rollbacks and record checkpoints, converts every such interruption into a clean, detectable rollback with negligible overhead on successful paths.

What carries the argument

SafeWriter, a language-level wrapper around Spark writes that uses a watchdog thread to force format-native rollbacks before Lambda termination.

If this is right

All kill scenarios in the inter-phase gap result in clean rollbacks instead of silent loss.
The added latency on normal writes stays below 100 milliseconds.
Failures become detectable through the recorded checkpoint documents on S3.
The protection applies equally to Delta Lake and Apache Iceberg formats.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar watchdog mechanisms could be adapted for other serverless compute platforms facing abrupt terminations.
Open table formats might incorporate built-in support for timeout-aware writes to reduce reliance on external wrappers.
Extending the experiments to include concurrent writes or larger scale could reveal additional edge cases in the characterization.

Load-bearing premise

The watchdog thread will always have enough time to arm itself and complete the rollback before the uncatchable SIGKILL terminates the process, and that the rollback will always restore a consistent table state.

What would settle it

A single experiment where SafeWriter is active, a SIGKILL arrives in the inter-phase gap, and the outcome is still silent data loss with orphaned files and no rollback record.

Figures

Figures reproduced from arXiv: 2604.20081 by Srujan Kumar Gandla.

read the original abstract

AWS Lambda terminates containers with an uncatchable SIGKILL signal when a function exceeds its configured timeout. When a Spark-on-AWS-Lambda (SoAL) job is killed between Phase 1 (data upload) and Phase 2 (metadata commit) of a write, the result is silent data loss: orphaned Parquet files accumulate on S3 while the table's committed state remains unchanged and standard monitoring raises no alert. We characterize this vulnerability across Delta Lake and Apache Iceberg through 860 controlled kill-injection experiments at three dataset sizes. A SIGKILL landing in the inter-phase gap produced silent data loss in 100% of trials for both formats. We then present SafeWriter, a language-level wrapper that arms a watchdog thread 30 seconds before the Lambda timeout, triggers a format-native rollback via SQL, and records a checkpoint document on S3. SafeWriter converted every tested kill scenario into a clean, detectable rollback with under 100 ms added to normal write paths.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper shows that Spark-on-Lambda writes to Delta Lake and Iceberg lose data silently 100% of the time when killed between file upload and metadata commit, and offers a watchdog wrapper that converts those cases to rollbacks in their controlled tests.

read the letter

The main thing to know is that this work identifies a reliable silent data loss path in serverless Spark jobs using open table formats. When AWS Lambda sends SIGKILL after the Parquet files are written to S3 but before the table metadata is committed, the result is orphaned files with no alert. Their 860 kill-injection experiments across three dataset sizes and both Delta Lake and Iceberg showed this happening every single time the kill landed in that inter-phase window. SafeWriter then adds a watchdog thread armed 30 seconds before timeout that triggers a native SQL rollback and writes a checkpoint; in their trials this produced clean, detectable rollbacks with under 100 ms overhead on normal paths. That characterization and the simple wrapper are the concrete contributions. The experiments give a clear picture of how the vulnerability behaves under controlled conditions, which is useful for anyone running production lakehouse workloads on Lambda. The mitigation is a straightforward application of existing rollback ideas to this specific setting and keeps the added cost low. The soft spots are around the mitigation's real-world reliability. The watchdog must complete its work before the uncatchable SIGKILL arrives, but JVM scheduling, GC pauses, or differences between their injection method and actual Lambda timeout behavior could let the loss slip through. The abstract gives no error bars, raw data, or full methods, so the 100% success rate is hard to evaluate for edge cases or consistency after rollback. This is aimed at engineers who operate Spark on Lambda with Delta or Iceberg and need better visibility into partial writes. It has enough empirical grounding and a practical fix to deserve serious referee time, even if the timing assumptions need more scrutiny. I would send it for review and ask for details on the watchdog's execution guarantees and reproducibility artifacts.

Referee Report

2 major / 2 minor

Summary. The paper claims that Spark-on-AWS-Lambda jobs using Delta Lake or Apache Iceberg suffer silent data loss when an uncatchable SIGKILL arrives between Phase 1 (Parquet upload to S3) and Phase 2 (metadata commit), leaving orphaned files undetected by standard monitoring. Through 860 controlled kill-injection experiments across three dataset sizes, the authors report a 100% loss rate for both formats in the inter-phase gap. They introduce SafeWriter, a language-level wrapper that arms a watchdog thread 30 s before the Lambda timeout to issue a format-native SQL rollback and write a checkpoint document on S3, converting every tested kill into a clean, detectable rollback at <100 ms added latency.

Significance. If the empirical results and mitigation hold under production conditions, the work is significant for serverless data pipelines that rely on open table formats. The large-scale controlled characterization directly documents a previously under-reported failure mode, and the low-overhead engineering fix is immediately actionable. Explicit credit is due for the scale of the 860-experiment campaign and the reproducible, format-native rollback approach.

major comments (2)

[Experiments section (abstract and §3)] Experiments section (abstract and §3): the central claim of consistent 100% silent data loss rests on 860 kill-injection trials, yet the manuscript provides neither full methods description, error bars, raw data, nor statistical analysis. This absence is load-bearing because the 100% figure cannot be independently verified from the available text.
[SafeWriter design (§4)] SafeWriter design (§4): the claim that the watchdog thread 'converted every tested kill scenario into a clean, detectable rollback' assumes the thread can reliably execute the SQL rollback and checkpoint write before the uncatchable SIGKILL arrives. The paper's controlled injection method does not demonstrate that the same timing holds under actual AWS Lambda container termination (including JVM scheduling, GC pauses, or differences between injection and real timeout behavior), leaving the production reliability of the mitigation unproven.

minor comments (2)

[Abstract] Abstract: the '<100 ms added to normal write paths' figure is stated without specifying measurement methodology or whether it includes the S3 checkpoint write.
[Overall] Overall: consider adding a reproducibility artifact (code, scripts, or raw logs) to allow readers to inspect the kill-injection harness and rollback logic.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the positive assessment of the work's significance and for the constructive comments on experimental rigor and mitigation reliability. We address each major comment below and will incorporate revisions to strengthen the manuscript.

read point-by-point responses

Referee: Experiments section (abstract and §3): the central claim of consistent 100% silent data loss rests on 860 kill-injection trials, yet the manuscript provides neither full methods description, error bars, raw data, nor statistical analysis. This absence is load-bearing because the 100% figure cannot be independently verified from the available text.

Authors: We agree that the current description of the experimental methods is insufficient for full reproducibility and independent verification. In the revised manuscript we will expand §3 with a detailed methods subsection covering the kill-injection protocol (exact timing relative to Phase 1/Phase 2 boundaries), Lambda configuration (timeout, memory, concurrency), Spark and open-table-format versions, dataset generation and sizes, and the precise criteria used to classify outcomes as silent data loss versus successful rollback. We will add a summary table of results stratified by format and dataset size. Because every one of the 860 trials produced the identical outcome, conventional error bars and statistical tests are not applicable; we will explicitly note the deterministic character of the failure mode under the tested conditions. Raw logs and scripts will be released as supplementary material through the project repository. These changes directly address the verifiability concern. revision: yes
Referee: SafeWriter design (§4): the claim that the watchdog thread 'converted every tested kill scenario into a clean, detectable rollback' assumes the thread can reliably execute the SQL rollback and checkpoint write before the uncatchable SIGKILL arrives. The paper's controlled injection method does not demonstrate that the same timing holds under actual AWS Lambda container termination (including JVM scheduling, GC pauses, or differences between injection and real timeout behavior), leaving the production reliability of the mitigation unproven.

Authors: We acknowledge that our controlled SIGKILL injection, while timed to mimic the inter-phase window, cannot fully replicate all runtime behaviors of real AWS Lambda container termination (e.g., JVM scheduling jitter or GC pauses). In the revised manuscript we will add a dedicated limitations paragraph in §4 that explicitly discusses these differences, the 30-second safety margin chosen for arming the watchdog, and the measured execution latency of the rollback path (<100 ms). We will also report additional timing traces from the test harness showing the margin between watchdog trigger and kill. Demonstrating exact equivalence to production termination semantics would require internal AWS instrumentation that is unavailable to us; we therefore cannot claim production-proof reliability. The revision will clarify that SafeWriter converts every tested kill into a detectable rollback while noting the remaining gap between controlled and live environments. revision: partial

Circularity Check

0 steps flagged

No circularity: empirical characterization plus engineering fix with no derivations or self-referential predictions

full rationale

The paper reports direct experimental results from 860 controlled kill-injection trials demonstrating 100% silent data loss when SIGKILL occurs between data upload and metadata commit phases for both Delta Lake and Iceberg. It then describes SafeWriter, a practical wrapper that arms a watchdog thread to trigger format-native rollback and checkpointing. No equations, fitted parameters, predictions that reduce to prior fits, uniqueness theorems, or self-citation chains are present. The central claims rest on observable outcomes under the described test conditions rather than any derivation that collapses to its own inputs by construction. This is a self-contained empirical and engineering contribution.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on standard assumptions about Lambda container termination behavior and the atomicity properties of open table formats; no free parameters, invented entities, or ad-hoc axioms are introduced in the abstract.

axioms (2)

domain assumption AWS Lambda delivers an uncatchable SIGKILL on timeout and cannot be intercepted by user code.
Invoked to explain why standard exception handling fails.
domain assumption Open table formats (Delta Lake, Iceberg) support native rollback operations via SQL that can be triggered after partial writes.
Required for SafeWriter to convert loss into clean rollback.

pith-pipeline@v0.9.0 · 5467 in / 1374 out tokens · 63424 ms · 2026-05-09T23:43:38.429588+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

13 extracted references · 13 canonical work pages

[1]

Spark on aws lambda,

AWS Samples, “Spark on aws lambda,” https://github.com/aws-samples/ spark-on-aws-lambda, 2022, open-source project for running Apache Spark inside AWS Lambda containers

work page 2022
[2]

Delta lake: High- performance acid table storage over cloud object stores,

M. Armbrust, T. Das, L. Sun, B. Yavuz, S. Zhu, M. Murthy, J. Torres, H. van Hovell, A. Ionescu, A. Łukacset al., “Delta lake: High- performance acid table storage over cloud object stores,” inProceedings of the VLDB Endowment, vol. 13, no. 12. VLDB Endowment, 2020, pp. 3411–3424

work page 2020
[3]

Apache iceberg: An open table format for huge analytic datasets,

R. Kinley and D. Blue, “Apache iceberg: An open table format for huge analytic datasets,” inProceedings of the 2020 ACM SIGMOD International Conference on Management of Data. ACM, 2020, pp. 2751–2753

work page 2020
[4]

Apache hudi: The data lake plat- form,

A. Sivachenko, S. Samineniet al., “Apache hudi: The data lake plat- form,” inProceedings of the VLDB Endowment, vol. 14, no. 12. VLDB Endowment, 2021

work page 2021
[5]

LST-Bench: Benchmarking log-structured tables in the cloud,

J. Camacho-Rodr ´ıguez, A. Agrawal, A. Gruenheid, A. Gosalia, C. Petculescu, J. Aguilar-Saborit, A. Floratou, C. Curino, and R. Ra- makrishnan, “LST-Bench: Benchmarking log-structured tables in the cloud,”Proceedings of the ACM on Management of Data, vol. 2, no. 1, 2024

work page 2024
[6]

Amazon s3 strong consistency,

Amazon Web Services, “Amazon s3 strong consistency,” https://aws.amazon.com/blogs/aws/ amazon-s3-update-strong-read-after-write-consistency/, 2020, aWS announcement of strong read-after-write consistency for S3, December 2020

work page 2020
[7]

Gray,Notes on Data Base Operating Systems

J. Gray,Notes on Data Base Operating Systems. Springer, 1978

work page 1978
[8]

AWS Lambda FAQs,

Amazon Web Services, “AWS Lambda FAQs,” https://aws.amazon.com/ lambda/faqs/, 2023, accessed 2026

work page 2023
[9]

Spatial data: Accommodations dataset,

——, “Spatial data: Accommodations dataset,” https://docs.aws.amazon. com/redshift/latest/dg/spatial-tutorial.html, 2022, used as benchmark dataset in AWS Redshift spatial tutorial

work page 2022
[10]

Cloudburst: Stateful functions-as- a-service,

V . Sreekanti, C. Wu, X. C. Lin, J. Schleier-Smith, J. M. Gonzalez, J. M. Hellerstein, and A. Tumanov, “Cloudburst: Stateful functions-as- a-service,” inProceedings of the VLDB Endowment, vol. 13, no. 12. VLDB Endowment, 2020, pp. 2438–2452

work page 2020
[11]

Understanding ephemeral storage for serverless analytics,

A. Klimovic, Y . Wang, C. Kozyrakis, P. Stuedi, J. Pfefferle, and A. Trivedi, “Understanding ephemeral storage for serverless analytics,” inProceedings of the 2018 USENIX Annual Technical Conference (ATC). USENIX, 2018, pp. 789–794

work page 2018
[12]

Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing,

M. Zaharia, M. Chowdhury, T. Das, A. Dave, J. Ma, M. McCauly, M. J. Franklin, S. Shenker, and I. Stoica, “Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing,” inProceed- ings of the 9th USENIX Symposium on Networked Systems Design and Implementation (NSDI). USENIX, 2012, pp. 15–28

work page 2012
[13]

Distributed snapshots: Determining global states of distributed systems,

K. M. Chandy and L. Lamport, “Distributed snapshots: Determining global states of distributed systems,”ACM Transactions on Computer Systems, vol. 3, no. 1, pp. 63–75, 1985

work page 1985

[1] [1]

Spark on aws lambda,

AWS Samples, “Spark on aws lambda,” https://github.com/aws-samples/ spark-on-aws-lambda, 2022, open-source project for running Apache Spark inside AWS Lambda containers

work page 2022

[2] [2]

Delta lake: High- performance acid table storage over cloud object stores,

M. Armbrust, T. Das, L. Sun, B. Yavuz, S. Zhu, M. Murthy, J. Torres, H. van Hovell, A. Ionescu, A. Łukacset al., “Delta lake: High- performance acid table storage over cloud object stores,” inProceedings of the VLDB Endowment, vol. 13, no. 12. VLDB Endowment, 2020, pp. 3411–3424

work page 2020

[3] [3]

Apache iceberg: An open table format for huge analytic datasets,

R. Kinley and D. Blue, “Apache iceberg: An open table format for huge analytic datasets,” inProceedings of the 2020 ACM SIGMOD International Conference on Management of Data. ACM, 2020, pp. 2751–2753

work page 2020

[4] [4]

Apache hudi: The data lake plat- form,

A. Sivachenko, S. Samineniet al., “Apache hudi: The data lake plat- form,” inProceedings of the VLDB Endowment, vol. 14, no. 12. VLDB Endowment, 2021

work page 2021

[5] [5]

LST-Bench: Benchmarking log-structured tables in the cloud,

J. Camacho-Rodr ´ıguez, A. Agrawal, A. Gruenheid, A. Gosalia, C. Petculescu, J. Aguilar-Saborit, A. Floratou, C. Curino, and R. Ra- makrishnan, “LST-Bench: Benchmarking log-structured tables in the cloud,”Proceedings of the ACM on Management of Data, vol. 2, no. 1, 2024

work page 2024

[6] [6]

Amazon s3 strong consistency,

Amazon Web Services, “Amazon s3 strong consistency,” https://aws.amazon.com/blogs/aws/ amazon-s3-update-strong-read-after-write-consistency/, 2020, aWS announcement of strong read-after-write consistency for S3, December 2020

work page 2020

[7] [7]

Gray,Notes on Data Base Operating Systems

J. Gray,Notes on Data Base Operating Systems. Springer, 1978

work page 1978

[8] [8]

AWS Lambda FAQs,

Amazon Web Services, “AWS Lambda FAQs,” https://aws.amazon.com/ lambda/faqs/, 2023, accessed 2026

work page 2023

[9] [9]

Spatial data: Accommodations dataset,

——, “Spatial data: Accommodations dataset,” https://docs.aws.amazon. com/redshift/latest/dg/spatial-tutorial.html, 2022, used as benchmark dataset in AWS Redshift spatial tutorial

work page 2022

[10] [10]

Cloudburst: Stateful functions-as- a-service,

V . Sreekanti, C. Wu, X. C. Lin, J. Schleier-Smith, J. M. Gonzalez, J. M. Hellerstein, and A. Tumanov, “Cloudburst: Stateful functions-as- a-service,” inProceedings of the VLDB Endowment, vol. 13, no. 12. VLDB Endowment, 2020, pp. 2438–2452

work page 2020

[11] [11]

Understanding ephemeral storage for serverless analytics,

A. Klimovic, Y . Wang, C. Kozyrakis, P. Stuedi, J. Pfefferle, and A. Trivedi, “Understanding ephemeral storage for serverless analytics,” inProceedings of the 2018 USENIX Annual Technical Conference (ATC). USENIX, 2018, pp. 789–794

work page 2018

[12] [12]

Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing,

M. Zaharia, M. Chowdhury, T. Das, A. Dave, J. Ma, M. McCauly, M. J. Franklin, S. Shenker, and I. Stoica, “Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing,” inProceed- ings of the 9th USENIX Symposium on Networked Systems Design and Implementation (NSDI). USENIX, 2012, pp. 15–28

work page 2012

[13] [13]

Distributed snapshots: Determining global states of distributed systems,

K. M. Chandy and L. Lamport, “Distributed snapshots: Determining global states of distributed systems,”ACM Transactions on Computer Systems, vol. 3, no. 1, pp. 63–75, 1985

work page 1985