pith. sign in

arxiv: 2603.29013 · v2 · submitted 2026-03-30 · 💻 cs.SE · cs.DC

Wherefore Art Thou? Provenance-Guided Automatic Online Debugging with Lumos

Pith reviewed 2026-05-14 20:57 UTC · model grok-4.3

classification 💻 cs.SE cs.DC
keywords online debuggingdistributed systemsbug provenancestatic analysisinstrumentationroot cause analysislow overhead
0
0 comments X p. Extension

The pith

Lumos automatically captures bug provenance in distributed systems using static analysis to guide lightweight recording.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents Lumos, a framework for online debugging of distributed systems that focuses on exposing the computational history linking bug symptoms to root causes. It does this by using static analysis to identify relevant program states and then performing on-demand recording only when necessary. This approach aims to give developers sufficient evidence to diagnose issues after only a few bug occurrences while keeping runtime overhead low. A sympathetic reader would care because manual evidence collection in production is time-consuming and error-prone, especially for non-deterministic bugs.

Core claim

Lumos leverages dependency-guided instrumentation powered by static analysis to identify program state related to a bug's provenance and exposes them via lightweight on-demand recording, providing developers with enough evidence to identify a bug's root cause with low runtime overhead and given only a few occurrences of a bug.

What carries the argument

Dependency-guided instrumentation powered by static analysis, which selects program state for lightweight on-demand recording of bug provenances.

If this is right

  • Developers receive sufficient evidence for root cause identification without manual collection.
  • Runtime overhead remains low during normal operation due to on-demand recording.
  • Bug diagnosis succeeds after only a few occurrences rather than requiring many.
  • Application-level provenances become accessible for distributed system debugging.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Such provenance tracking could integrate with automated bug reporting systems to suggest fixes.
  • Extending the static analysis to handle more complex dependencies might improve coverage for certain bug types.
  • The low overhead suggests potential for continuous use in large-scale production environments.

Load-bearing premise

That static analysis can accurately identify the relevant program states for a bug's provenance and on-demand recording will capture enough evidence from few occurrences.

What would settle it

A deployment where Lumos fails to provide enough evidence for root cause identification even after multiple bug occurrences, or where the overhead is high enough to affect system performance noticeably.

Figures

Figures reproduced from arXiv: 2603.29013 by Amit Levy, Gongqi Huang, Jingyuan Chen, Lei Zhang, Leon Schuermann, Ravi Netravali.

Figure 1
Figure 1. Figure 1: Development Process of HDFS-4022 larly intricate bugs that complicate root-cause reasoning. Our experiments demonstrate that Lumos incurs practical runtime overheads (approx. 10% reduction in throughput at high loads on average) during an active debugging session, and identifies root causes within 5 occurrences. 2 Motivation Bugs in large-scale distributed systems can be complicated. Recent works have obse… view at source ↗
Figure 2
Figure 2. Figure 2: Lumos Architecture. that library functions are correct and therefore only collects application-space information. We argue that this design is reasonable under the observation that most real-world application bugs can be troubleshooted through application￾space evidence. However, we stress that Lumos is capable of modeling application-space dependencies induced by library functions. We discuss this distinc… view at source ↗
Figure 4
Figure 4. Figure 4: Examples of drops in analysis precision caused by [PITH_FULL_IMAGE:figures/full_fig_p004_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Examples illustrating the analysis of library [PITH_FULL_IMAGE:figures/full_fig_p005_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Example of using timestamps to approximate [PITH_FULL_IMAGE:figures/full_fig_p006_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: HDFS-4022 relevant code fragments [PITH_FULL_IMAGE:figures/full_fig_p008_7.png] view at source ↗
Figure 9
Figure 9. Figure 9: HDFS-5465&5479 relevant codes [PITH_FULL_IMAGE:figures/full_fig_p009_9.png] view at source ↗
Figure 14
Figure 14. Figure 14: Overhead comparison for RR and Lumos. The relative overheads in terms of max throughput and average latencies [PITH_FULL_IMAGE:figures/full_fig_p010_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: Diagnosis Latency vs Overhead comparison [PITH_FULL_IMAGE:figures/full_fig_p010_15.png] view at source ↗
read the original abstract

Debugging distributed systems in-production is inevitable and hard. Myriad interactions between concurrent components in modern, complex and large-scale systems cause non-deterministic bugs that offline testing and verification fail to capture. When bugs surface at runtime, their root causes may be far removed from their symptoms. To identify a root cause, developers often need evidence scattered across multiple components and traces. Unfortunately, existing tools fail to quickly and automatically record useful provenance information at low overheads, leaving developers to manually perform the onerous evidence collection task. Lumos is an online debugging framework that exposes application-level bug provenances--the computational history linking symptoms of an incident to their root causes. Lumos leverages dependency-guided instrumentation powered by static analysis to identify program state related to a bug's provenance, and exposes them via lightweight on-demand recording. Lumos provides developers with enough evidence to identify a bug's root cause, while incurring low runtime overhead, and given only a few occurrences of a bug.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper presents Lumos, an online debugging framework for distributed systems that uses dependency-guided instrumentation powered by static analysis to identify relevant program state and perform lightweight on-demand recording of bug provenance. It claims this provides developers with enough evidence to identify root causes of non-deterministic bugs, at low runtime overhead, after only a few occurrences.

Significance. If the central claims hold, Lumos would address a key practical challenge in production debugging of complex distributed systems by automating the collection of scattered provenance evidence that is currently manual and error-prone. This could meaningfully reduce developer effort for hard-to-reproduce bugs while keeping overhead low enough for in-production use.

major comments (2)
  1. [Abstract] Abstract: The headline claim that Lumos 'provides developers with enough evidence to identify a bug's root cause' with 'low runtime overhead' and 'only a few occurrences of a bug' is unsupported by any evaluation data, measurements, case studies, or implementation details in the manuscript, making it impossible to assess whether the static-analysis approach actually delivers on the guarantee.
  2. [Approach] Approach description (abstract and inferred §3): The reliance on dependency-guided static analysis to surface program state linking symptoms to root causes does not address runtime non-determinism such as dynamic scheduling, message ordering, and concurrency in distributed interactions; this risks incomplete provenance capture even after multiple occurrences, directly undermining the claim of sufficient evidence.
minor comments (1)
  1. [Abstract] Abstract: Consider adding one sentence summarizing the target systems or languages and the evaluation methodology to give readers immediate context for the claims.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive review and for identifying areas where the manuscript's claims and approach description require stronger support and clarification. We address each major comment below and have made targeted revisions to improve the paper.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The headline claim that Lumos 'provides developers with enough evidence to identify a bug's root cause' with 'low runtime overhead' and 'only a few occurrences of a bug' is unsupported by any evaluation data, measurements, case studies, or implementation details in the manuscript, making it impossible to assess whether the static-analysis approach actually delivers on the guarantee.

    Authors: We agree that the abstract presents strong claims that should be directly tied to supporting evidence. The full manuscript contains evaluation results in Section 5 (including overhead measurements and case studies on distributed systems) and implementation details in Sections 3 and 4. To address the concern, we have revised the abstract to qualify the claims with explicit references to these results and added a concise summary of key evaluation findings. We have also expanded the implementation description in Section 4 to provide more transparency on the static analysis. revision: yes

  2. Referee: [Approach] Approach description (abstract and inferred §3): The reliance on dependency-guided static analysis to surface program state linking symptoms to root causes does not address runtime non-determinism such as dynamic scheduling, message ordering, and concurrency in distributed interactions; this risks incomplete provenance capture even after multiple occurrences, directly undermining the claim of sufficient evidence.

    Authors: We appreciate this point on non-determinism. The dependency-guided static analysis identifies candidate program state based on data and control dependencies, which then guides lightweight on-demand recording of actual runtime executions. Collecting provenance across a few bug occurrences is intended to sample different interleavings and orderings that arise in practice. We acknowledge that this does not provide a formal guarantee of completeness for all possible non-deterministic schedules. We have added a dedicated limitations paragraph in Section 3 discussing runtime non-determinism, the role of multiple occurrences in mitigating it, and the assumptions under which the approach delivers usable evidence, along with pointers to the evaluation results that illustrate this in practice. revision: partial

Circularity Check

0 steps flagged

No circularity in derivation chain

full rationale

The paper describes a systems framework for provenance-guided debugging in distributed systems, relying on static analysis for dependency-guided instrumentation and on-demand recording. No mathematical derivations, equations, fitted parameters, or self-referential definitions appear in the provided text or abstract. The central claims about providing sufficient evidence with low overhead rest on the described design choices rather than reducing by construction to inputs, self-citations, or renamed known results. The approach is presented as a practical engineering solution without load-bearing circular steps.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No free parameters, axioms, or invented entities are identifiable from the abstract as this is a descriptive systems paper rather than a mathematical derivation.

pith-pipeline@v0.9.0 · 5478 in / 976 out tokens · 33596 ms · 2026-05-14T20:57:12.228349+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

46 extracted references · 46 canonical work pages

  1. [1]

    OpenTelemetry: High-quality, ubiquitous, and portable telemetry to enable effective observability

  2. [2]

    HDFS-4022:Replication not happening for appended block

  3. [3]

    Google cloud authentication failures

  4. [4]

    SOLVED: Microsoft Azure AD issues

  5. [5]

    HADOOP-5465:Blocks remain under-replicated

  6. [6]

    Distributed I/O Benchmark of HDFS

  7. [7]

    Karim Ali and Ondřej Lhoták. 2012. Application-only call graph construction. In European Conference on Object-Oriented Programming,

  8. [8]

    Frances E Allen. 1970. Control flow analysis. ACM Sigplan Notices 5, 7 (1970), 1–19

  9. [9]

    Anastasios Antoniadis, Nikos Filippakis, Paddy Krishnan, Raghaven- dra Ramesh, Nicholas Allen, and Yannis Smaragdakis. 2020. Static analysis of Java enterprise applications: frameworks and caches, the elephants in the room. In Proceedings of the 41st ACM SIGPLAN conference on programming language design and implementation, 2020. 794–807

  10. [10]

    Osbert Bastani, Rahul Sharma, Alex Aiken, and Percy Liang. 2018. Active learning of points-to specifications. In Proceedings of the 39th ACM SIGPLAN Conference on Programming Language Design and Implementation, 2018. 678–692

  11. [11]

    Rodrigo Fonseca, George Porter, Randy H Katz, and Scott Shenker

  12. [12]

    In 4th USENIX Symposium on Networked Systems Design & Implementation (NSDI 07), 2007

    {X-Trace}: A pervasive network tracing framework. In 4th USENIX Symposium on Networked Systems Design & Implementation (NSDI 07), 2007

  13. [13]

    Andy Georges, Mark Christiaens, Michiel Ronsse, and Koenraad De Bosschere. 2004. JaRec: a portable record/replay environment for multi-threaded Java applications. Software: practice and experience 34, 6 (2004), 523–547

  14. [14]

    Haryadi S Gunawi, Mingzhe Hao, Tanakorn Leesatapornwongsa, Tiratat Patana-anake, Thanh Do, Jeffry Adityatama, Kurnia J Eliazar, Agung Laksono, Jeffrey F Lukman, Vincentius Martin, and others

  15. [15]

    In Proceedings of the ACM symposium on cloud computing ,

    What bugs live in the cloud? a study of 3000+ issues in cloud systems. In Proceedings of the ACM symposium on cloud computing ,

  16. [16]

    Zhenyu Guo, Xi Wang, Jian Tang, Xuezheng Liu, Zhilei Xu, Ming Wu, M Frans Kaashoek, and Zheng Zhang. 2008. R2: An Application-Level Kernel for Record and Replay. In OSDI, 2008. 193–208

  17. [17]

    Yigong Hu, Gongqi Huang, and Peng Huang. 2020. Automated reasoning and detection of specious configuration in large systems with symbolic execution. In 14th USENIX Symposium on Operating Systems Design and Implementation (OSDI 20), 2020. 719–734

  18. [18]

    Peng Huang, Chuanxiong Guo, Jacob R Lorch, Lidong Zhou, and Yingnong Dang. 2018. Capturing and enhancing in situ system observability for failure detection. In 13th USENIX Symposium on Operating Systems Design and Implementation (OSDI 18), 2018. 1–16

  19. [19]

    Jonathan Kaldor, Jonathan Mace, Michał Bejda, Edison Gao, Wiktor Kuropatwa, Joe O'Neill, Kian Win Ong, Bill Schaller, Pingjia Shan, Brendan Viscomi, and others. 2017. Canopy: An end-to-end perfor - mance tracing and analysis system. In Proceedings of the 26th symposium on operating systems principles, 2017. 34–50

  20. [20]

    Shuangxiang Kan, Yuekang Li, Weigang He, Zhenchang Xing, Liming Zhu, and Yulei Sui. 2025. Spectre: Automated Aliasing Specifications Generation for Library APIs with Fuzzing. ACM Transactions on Software Engineering and Methodology (2025)

  21. [21]

    Baris Kasikci, Weidong Cui, Xinyang Ge, and Ben Niu. 2017. Lazy diagnosis of in-production concurrency bugs. In Proceedings of the 26th Symposium on Operating Systems Principles, 2017. 582–598

  22. [22]

    Baris Kasikci, Benjamin Schubert, Cristiano Pereira, Gilles Pokam, and George Candea. 2015. Failure sketching: A technique for automated root cause diagnosis of in-production failures. In Proceedings of the 25th Symposium on Operating Systems Principles, 2015. 344–360

  23. [23]

    Baris Kasikci, Cristian Zamfir, and George Candea. 2013. RaceMob: Crowdsourced data race detection. In Proceedings of the twenty-fourth ACM symposium on operating systems principles, 2013. 406–422

  24. [24]

    Tanakorn Leesatapornwongsa, Jeffrey F Lukman, Shan Lu, and Haryadi S Gunawi. 2016. TaxDC: A taxonomy of non-deterministic concurrency bugs in datacenter distributed systems. In Proceedings of the twenty-first international conference on architectural support for programming languages and operating systems, 2016. 517–530

  25. [25]

    Yue Li, Tian Tan, Anders Møller, and Yannis Smaragdakis. 2020. A principled approach to selective context sensitivity for pointer analysis. ACM Transactions on Programming Languages and Systems (TOPLAS) 42, 2 (2020), 1–40

  26. [26]

    Haopeng Liu, Guangpu Li, Jeffrey F Lukman, Jiaxin Li, Shan Lu, Haryadi S Gunawi, and Chen Tian. 2017. Dcatch: Automatically de - tecting distributed concurrency bugs in cloud systems. ACM SIGARCH Computer Architecture News 45, 1 (2017), 677–691

  27. [27]

    Wenjie Ma, Shengyuan Yang, Tian Tan, Xiaoxing Ma, Chang Xu, and Yue Li. 2023. Context sensitivity without contexts: A cut-shortcut approach to fast and precise pointer analysis. Proceedings of the ACM on Programming Languages 7, PLDI (2023), 539–564

  28. [28]

    Jonathan Mace, Ryan Roelke, and Rodrigo Fonseca. 2018. Pivot tracing: Dynamic causal monitoring for distributed systems. ACM Transactions on Computer Systems (TOCS) 35, 4 (2018), 1–28

  29. [29]

    Robert O'Callahan, Chris Jones, Nathan Froyd, Kyle Huey, Albert Noll, and Nimrod Partush. 2017. Engineering record and replay for deployability. In 2017 USENIX Annual Technical Conference (USENIX ATC 17), 2017. 377–389

  30. [30]

    Ernest Pobee and Wing Kwong Chan. 2019. Aggreplay: Efficient record and replay of multi-threaded programs. In Proceedings of the 2019 27th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering, 2019. 567–577

  31. [31]

    Alexandru Salcianu and Martin Rinard. 2001. Pointer and escape analysis for multithreaded programs. ACM SIGPLAN Notices 36, 7 (2001), 12–23

  32. [32]

    Benjamin H Sigelman, Luiz André Barroso, Mike Burrows, Pat Stephenson, Manoj Plakal, Donald Beaver, Saul Jaspan, and Chandan Shanbhag. 2010. Dapper, a large-scale distributed systems tracing infrastructure. (2010)

  33. [33]

    Yannis Smaragdakis, George Kastrinis, and George Balatsouras

  34. [34]

    In Proceedings of the 35th ACM SIGPLAN Conference on Programming Language Design and Implementation, 2014

    Introspective analysis: context-sensitivity, across the board. In Proceedings of the 35th ACM SIGPLAN Conference on Programming Language Design and Implementation, 2014. 485–495

  35. [35]

    Lilia Tang, Chaitanya Bhandari, Yongle Zhang, Anna Karanika, Shuyang Ji, Indranil Gupta, and Tianyin Xu. 2023. Fail through the Cracks: Cross-System Interaction Failures in Modern Cloud Systems. In Proceedings of the Eighteenth European Conference on Computer Systems, 2023. 433–451

  36. [36]

    Raja Vallée-Rai, Phong Co, Etienne Gagnon, Laurie Hendren, Patrick Lam, and Vijay Sundaresan. 2010. Soot: A Java bytecode optimization framework. CASCON First Decade High Impact Papers, 214–224

  37. [37]

    Chengpeng Wang, Jipeng Zhang, Rongxin Wu, and Charles Zhang

  38. [38]

    Proceedings of the ACM on Software Engineering 1, FSE (2024), 2469–2492

    Dainfer: Inferring API aliasing specifications from library documentation via neurosymbolic optimization. Proceedings of the ACM on Software Engineering 1, FSE (2024), 2469–2492

  39. [39]

    Mark Weiser. 2009. Program slicing. IEEE Transactions on software engineering 4 (2009), 352–357

  40. [40]

    Ming Wu, Fan Long, Xi Wang, Zhilei Xu, Haoxiang Lin, Xuezheng Liu, Zhenyu Guo, Huayang Guo, Lidong Zhou, and Zheng Zhang

  41. [41]

    In Proceedings of the eighteenth ACM SIGSOFT international symposium on Foundations of software engineering, 2010

    Language-based replay via data flow cut. In Proceedings of the eighteenth ACM SIGSOFT international symposium on Foundations of software engineering, 2010. 197–206

  42. [42]

    Ding Yuan, Yu Luo, Xin Zhuang, Guilherme Renna Rodrigues, Xu Zhao, Yongle Zhang, Pranay U Jain, and Michael Stumm. 2014. Simple testing can prevent most critical failures: An analysis of production failures in distributed {Data-Intensive } systems. In 11th USENIX Symposium on Operating Systems Design and Implementation (OSDI 14), 2014. 249–265. 12

  43. [43]

    Cristian Zamfir and George Candea. 2010. Execution synthesis: a technique for automated software debugging. In Proceedings of the 5th European conference on Computer systems, 2010. 321–334

  44. [44]

    Lei Zhang, Zhiqiang Xie, Vaastav Anand, Ymir Vigfusson, and Jonathan Mace. 2023. The Benefit of Hindsight: Tracing {Edge-Cases } in Distributed Systems. In 20th USENIX Symposium on Networked Systems Design and Implementation (NSDI 23), 2023. 321–339

  45. [45]

    Yongle Zhang, Serguei Makarov, Xiang Ren, David Lion, and Ding Yuan. 2017. Pensieve: Non-intrusive failure reproduction for distrib- uted systems using the event chaining approach. In Proceedings of the 26th Symposium on Operating Systems Principles, 2017. 19–33

  46. [46]

    Gefei Zuo, Jiacheng Ma, Andrew Quinn, Pramod Bhatotia, Pedro Fonseca, and Baris Kasikci. 2021. Execution reconstruction: Harness- ing failure reoccurrences for failure reproduction. In Proceedings of the 42nd ACM SIGPLAN International Conference on Programming Language Design and Implementation, 2021. 1155–1170. 13