Wherefore Art Thou? Provenance-Guided Automatic Online Debugging with Lumos
Pith reviewed 2026-05-14 20:57 UTC · model grok-4.3
The pith
Lumos automatically captures bug provenance in distributed systems using static analysis to guide lightweight recording.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Lumos leverages dependency-guided instrumentation powered by static analysis to identify program state related to a bug's provenance and exposes them via lightweight on-demand recording, providing developers with enough evidence to identify a bug's root cause with low runtime overhead and given only a few occurrences of a bug.
What carries the argument
Dependency-guided instrumentation powered by static analysis, which selects program state for lightweight on-demand recording of bug provenances.
If this is right
- Developers receive sufficient evidence for root cause identification without manual collection.
- Runtime overhead remains low during normal operation due to on-demand recording.
- Bug diagnosis succeeds after only a few occurrences rather than requiring many.
- Application-level provenances become accessible for distributed system debugging.
Where Pith is reading between the lines
- Such provenance tracking could integrate with automated bug reporting systems to suggest fixes.
- Extending the static analysis to handle more complex dependencies might improve coverage for certain bug types.
- The low overhead suggests potential for continuous use in large-scale production environments.
Load-bearing premise
That static analysis can accurately identify the relevant program states for a bug's provenance and on-demand recording will capture enough evidence from few occurrences.
What would settle it
A deployment where Lumos fails to provide enough evidence for root cause identification even after multiple bug occurrences, or where the overhead is high enough to affect system performance noticeably.
Figures
read the original abstract
Debugging distributed systems in-production is inevitable and hard. Myriad interactions between concurrent components in modern, complex and large-scale systems cause non-deterministic bugs that offline testing and verification fail to capture. When bugs surface at runtime, their root causes may be far removed from their symptoms. To identify a root cause, developers often need evidence scattered across multiple components and traces. Unfortunately, existing tools fail to quickly and automatically record useful provenance information at low overheads, leaving developers to manually perform the onerous evidence collection task. Lumos is an online debugging framework that exposes application-level bug provenances--the computational history linking symptoms of an incident to their root causes. Lumos leverages dependency-guided instrumentation powered by static analysis to identify program state related to a bug's provenance, and exposes them via lightweight on-demand recording. Lumos provides developers with enough evidence to identify a bug's root cause, while incurring low runtime overhead, and given only a few occurrences of a bug.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper presents Lumos, an online debugging framework for distributed systems that uses dependency-guided instrumentation powered by static analysis to identify relevant program state and perform lightweight on-demand recording of bug provenance. It claims this provides developers with enough evidence to identify root causes of non-deterministic bugs, at low runtime overhead, after only a few occurrences.
Significance. If the central claims hold, Lumos would address a key practical challenge in production debugging of complex distributed systems by automating the collection of scattered provenance evidence that is currently manual and error-prone. This could meaningfully reduce developer effort for hard-to-reproduce bugs while keeping overhead low enough for in-production use.
major comments (2)
- [Abstract] Abstract: The headline claim that Lumos 'provides developers with enough evidence to identify a bug's root cause' with 'low runtime overhead' and 'only a few occurrences of a bug' is unsupported by any evaluation data, measurements, case studies, or implementation details in the manuscript, making it impossible to assess whether the static-analysis approach actually delivers on the guarantee.
- [Approach] Approach description (abstract and inferred §3): The reliance on dependency-guided static analysis to surface program state linking symptoms to root causes does not address runtime non-determinism such as dynamic scheduling, message ordering, and concurrency in distributed interactions; this risks incomplete provenance capture even after multiple occurrences, directly undermining the claim of sufficient evidence.
minor comments (1)
- [Abstract] Abstract: Consider adding one sentence summarizing the target systems or languages and the evaluation methodology to give readers immediate context for the claims.
Simulated Author's Rebuttal
We thank the referee for their constructive review and for identifying areas where the manuscript's claims and approach description require stronger support and clarification. We address each major comment below and have made targeted revisions to improve the paper.
read point-by-point responses
-
Referee: [Abstract] Abstract: The headline claim that Lumos 'provides developers with enough evidence to identify a bug's root cause' with 'low runtime overhead' and 'only a few occurrences of a bug' is unsupported by any evaluation data, measurements, case studies, or implementation details in the manuscript, making it impossible to assess whether the static-analysis approach actually delivers on the guarantee.
Authors: We agree that the abstract presents strong claims that should be directly tied to supporting evidence. The full manuscript contains evaluation results in Section 5 (including overhead measurements and case studies on distributed systems) and implementation details in Sections 3 and 4. To address the concern, we have revised the abstract to qualify the claims with explicit references to these results and added a concise summary of key evaluation findings. We have also expanded the implementation description in Section 4 to provide more transparency on the static analysis. revision: yes
-
Referee: [Approach] Approach description (abstract and inferred §3): The reliance on dependency-guided static analysis to surface program state linking symptoms to root causes does not address runtime non-determinism such as dynamic scheduling, message ordering, and concurrency in distributed interactions; this risks incomplete provenance capture even after multiple occurrences, directly undermining the claim of sufficient evidence.
Authors: We appreciate this point on non-determinism. The dependency-guided static analysis identifies candidate program state based on data and control dependencies, which then guides lightweight on-demand recording of actual runtime executions. Collecting provenance across a few bug occurrences is intended to sample different interleavings and orderings that arise in practice. We acknowledge that this does not provide a formal guarantee of completeness for all possible non-deterministic schedules. We have added a dedicated limitations paragraph in Section 3 discussing runtime non-determinism, the role of multiple occurrences in mitigating it, and the assumptions under which the approach delivers usable evidence, along with pointers to the evaluation results that illustrate this in practice. revision: partial
Circularity Check
No circularity in derivation chain
full rationale
The paper describes a systems framework for provenance-guided debugging in distributed systems, relying on static analysis for dependency-guided instrumentation and on-demand recording. No mathematical derivations, equations, fitted parameters, or self-referential definitions appear in the provided text or abstract. The central claims about providing sufficient evidence with low overhead rest on the described design choices rather than reducing by construction to inputs, self-citations, or renamed known results. The approach is presented as a practical engineering solution without load-bearing circular steps.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
OpenTelemetry: High-quality, ubiquitous, and portable telemetry to enable effective observability
-
[2]
HDFS-4022:Replication not happening for appended block
-
[3]
Google cloud authentication failures
-
[4]
SOLVED: Microsoft Azure AD issues
-
[5]
HADOOP-5465:Blocks remain under-replicated
-
[6]
Distributed I/O Benchmark of HDFS
-
[7]
Karim Ali and Ondřej Lhoták. 2012. Application-only call graph construction. In European Conference on Object-Oriented Programming,
work page 2012
-
[8]
Frances E Allen. 1970. Control flow analysis. ACM Sigplan Notices 5, 7 (1970), 1–19
work page 1970
-
[9]
Anastasios Antoniadis, Nikos Filippakis, Paddy Krishnan, Raghaven- dra Ramesh, Nicholas Allen, and Yannis Smaragdakis. 2020. Static analysis of Java enterprise applications: frameworks and caches, the elephants in the room. In Proceedings of the 41st ACM SIGPLAN conference on programming language design and implementation, 2020. 794–807
work page 2020
-
[10]
Osbert Bastani, Rahul Sharma, Alex Aiken, and Percy Liang. 2018. Active learning of points-to specifications. In Proceedings of the 39th ACM SIGPLAN Conference on Programming Language Design and Implementation, 2018. 678–692
work page 2018
-
[11]
Rodrigo Fonseca, George Porter, Randy H Katz, and Scott Shenker
-
[12]
In 4th USENIX Symposium on Networked Systems Design & Implementation (NSDI 07), 2007
{X-Trace}: A pervasive network tracing framework. In 4th USENIX Symposium on Networked Systems Design & Implementation (NSDI 07), 2007
work page 2007
-
[13]
Andy Georges, Mark Christiaens, Michiel Ronsse, and Koenraad De Bosschere. 2004. JaRec: a portable record/replay environment for multi-threaded Java applications. Software: practice and experience 34, 6 (2004), 523–547
work page 2004
-
[14]
Haryadi S Gunawi, Mingzhe Hao, Tanakorn Leesatapornwongsa, Tiratat Patana-anake, Thanh Do, Jeffry Adityatama, Kurnia J Eliazar, Agung Laksono, Jeffrey F Lukman, Vincentius Martin, and others
-
[15]
In Proceedings of the ACM symposium on cloud computing ,
What bugs live in the cloud? a study of 3000+ issues in cloud systems. In Proceedings of the ACM symposium on cloud computing ,
-
[16]
Zhenyu Guo, Xi Wang, Jian Tang, Xuezheng Liu, Zhilei Xu, Ming Wu, M Frans Kaashoek, and Zheng Zhang. 2008. R2: An Application-Level Kernel for Record and Replay. In OSDI, 2008. 193–208
work page 2008
-
[17]
Yigong Hu, Gongqi Huang, and Peng Huang. 2020. Automated reasoning and detection of specious configuration in large systems with symbolic execution. In 14th USENIX Symposium on Operating Systems Design and Implementation (OSDI 20), 2020. 719–734
work page 2020
-
[18]
Peng Huang, Chuanxiong Guo, Jacob R Lorch, Lidong Zhou, and Yingnong Dang. 2018. Capturing and enhancing in situ system observability for failure detection. In 13th USENIX Symposium on Operating Systems Design and Implementation (OSDI 18), 2018. 1–16
work page 2018
-
[19]
Jonathan Kaldor, Jonathan Mace, Michał Bejda, Edison Gao, Wiktor Kuropatwa, Joe O'Neill, Kian Win Ong, Bill Schaller, Pingjia Shan, Brendan Viscomi, and others. 2017. Canopy: An end-to-end perfor - mance tracing and analysis system. In Proceedings of the 26th symposium on operating systems principles, 2017. 34–50
work page 2017
-
[20]
Shuangxiang Kan, Yuekang Li, Weigang He, Zhenchang Xing, Liming Zhu, and Yulei Sui. 2025. Spectre: Automated Aliasing Specifications Generation for Library APIs with Fuzzing. ACM Transactions on Software Engineering and Methodology (2025)
work page 2025
-
[21]
Baris Kasikci, Weidong Cui, Xinyang Ge, and Ben Niu. 2017. Lazy diagnosis of in-production concurrency bugs. In Proceedings of the 26th Symposium on Operating Systems Principles, 2017. 582–598
work page 2017
-
[22]
Baris Kasikci, Benjamin Schubert, Cristiano Pereira, Gilles Pokam, and George Candea. 2015. Failure sketching: A technique for automated root cause diagnosis of in-production failures. In Proceedings of the 25th Symposium on Operating Systems Principles, 2015. 344–360
work page 2015
-
[23]
Baris Kasikci, Cristian Zamfir, and George Candea. 2013. RaceMob: Crowdsourced data race detection. In Proceedings of the twenty-fourth ACM symposium on operating systems principles, 2013. 406–422
work page 2013
-
[24]
Tanakorn Leesatapornwongsa, Jeffrey F Lukman, Shan Lu, and Haryadi S Gunawi. 2016. TaxDC: A taxonomy of non-deterministic concurrency bugs in datacenter distributed systems. In Proceedings of the twenty-first international conference on architectural support for programming languages and operating systems, 2016. 517–530
work page 2016
-
[25]
Yue Li, Tian Tan, Anders Møller, and Yannis Smaragdakis. 2020. A principled approach to selective context sensitivity for pointer analysis. ACM Transactions on Programming Languages and Systems (TOPLAS) 42, 2 (2020), 1–40
work page 2020
-
[26]
Haopeng Liu, Guangpu Li, Jeffrey F Lukman, Jiaxin Li, Shan Lu, Haryadi S Gunawi, and Chen Tian. 2017. Dcatch: Automatically de - tecting distributed concurrency bugs in cloud systems. ACM SIGARCH Computer Architecture News 45, 1 (2017), 677–691
work page 2017
-
[27]
Wenjie Ma, Shengyuan Yang, Tian Tan, Xiaoxing Ma, Chang Xu, and Yue Li. 2023. Context sensitivity without contexts: A cut-shortcut approach to fast and precise pointer analysis. Proceedings of the ACM on Programming Languages 7, PLDI (2023), 539–564
work page 2023
-
[28]
Jonathan Mace, Ryan Roelke, and Rodrigo Fonseca. 2018. Pivot tracing: Dynamic causal monitoring for distributed systems. ACM Transactions on Computer Systems (TOCS) 35, 4 (2018), 1–28
work page 2018
-
[29]
Robert O'Callahan, Chris Jones, Nathan Froyd, Kyle Huey, Albert Noll, and Nimrod Partush. 2017. Engineering record and replay for deployability. In 2017 USENIX Annual Technical Conference (USENIX ATC 17), 2017. 377–389
work page 2017
-
[30]
Ernest Pobee and Wing Kwong Chan. 2019. Aggreplay: Efficient record and replay of multi-threaded programs. In Proceedings of the 2019 27th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering, 2019. 567–577
work page 2019
-
[31]
Alexandru Salcianu and Martin Rinard. 2001. Pointer and escape analysis for multithreaded programs. ACM SIGPLAN Notices 36, 7 (2001), 12–23
work page 2001
-
[32]
Benjamin H Sigelman, Luiz André Barroso, Mike Burrows, Pat Stephenson, Manoj Plakal, Donald Beaver, Saul Jaspan, and Chandan Shanbhag. 2010. Dapper, a large-scale distributed systems tracing infrastructure. (2010)
work page 2010
-
[33]
Yannis Smaragdakis, George Kastrinis, and George Balatsouras
-
[34]
Introspective analysis: context-sensitivity, across the board. In Proceedings of the 35th ACM SIGPLAN Conference on Programming Language Design and Implementation, 2014. 485–495
work page 2014
-
[35]
Lilia Tang, Chaitanya Bhandari, Yongle Zhang, Anna Karanika, Shuyang Ji, Indranil Gupta, and Tianyin Xu. 2023. Fail through the Cracks: Cross-System Interaction Failures in Modern Cloud Systems. In Proceedings of the Eighteenth European Conference on Computer Systems, 2023. 433–451
work page 2023
-
[36]
Raja Vallée-Rai, Phong Co, Etienne Gagnon, Laurie Hendren, Patrick Lam, and Vijay Sundaresan. 2010. Soot: A Java bytecode optimization framework. CASCON First Decade High Impact Papers, 214–224
work page 2010
-
[37]
Chengpeng Wang, Jipeng Zhang, Rongxin Wu, and Charles Zhang
-
[38]
Proceedings of the ACM on Software Engineering 1, FSE (2024), 2469–2492
Dainfer: Inferring API aliasing specifications from library documentation via neurosymbolic optimization. Proceedings of the ACM on Software Engineering 1, FSE (2024), 2469–2492
work page 2024
-
[39]
Mark Weiser. 2009. Program slicing. IEEE Transactions on software engineering 4 (2009), 352–357
work page 2009
-
[40]
Ming Wu, Fan Long, Xi Wang, Zhilei Xu, Haoxiang Lin, Xuezheng Liu, Zhenyu Guo, Huayang Guo, Lidong Zhou, and Zheng Zhang
-
[41]
Language-based replay via data flow cut. In Proceedings of the eighteenth ACM SIGSOFT international symposium on Foundations of software engineering, 2010. 197–206
work page 2010
-
[42]
Ding Yuan, Yu Luo, Xin Zhuang, Guilherme Renna Rodrigues, Xu Zhao, Yongle Zhang, Pranay U Jain, and Michael Stumm. 2014. Simple testing can prevent most critical failures: An analysis of production failures in distributed {Data-Intensive } systems. In 11th USENIX Symposium on Operating Systems Design and Implementation (OSDI 14), 2014. 249–265. 12
work page 2014
-
[43]
Cristian Zamfir and George Candea. 2010. Execution synthesis: a technique for automated software debugging. In Proceedings of the 5th European conference on Computer systems, 2010. 321–334
work page 2010
-
[44]
Lei Zhang, Zhiqiang Xie, Vaastav Anand, Ymir Vigfusson, and Jonathan Mace. 2023. The Benefit of Hindsight: Tracing {Edge-Cases } in Distributed Systems. In 20th USENIX Symposium on Networked Systems Design and Implementation (NSDI 23), 2023. 321–339
work page 2023
-
[45]
Yongle Zhang, Serguei Makarov, Xiang Ren, David Lion, and Ding Yuan. 2017. Pensieve: Non-intrusive failure reproduction for distrib- uted systems using the event chaining approach. In Proceedings of the 26th Symposium on Operating Systems Principles, 2017. 19–33
work page 2017
-
[46]
Gefei Zuo, Jiacheng Ma, Andrew Quinn, Pramod Bhatotia, Pedro Fonseca, and Baris Kasikci. 2021. Execution reconstruction: Harness- ing failure reoccurrences for failure reproduction. In Proceedings of the 42nd ACM SIGPLAN International Conference on Programming Language Design and Implementation, 2021. 1155–1170. 13
work page 2021
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.