Wherefore Art Thou? Provenance-Guided Automatic Online Debugging with Lumos

arxiv: 2603.29013 · v2 · submitted 2026-03-30 · 💻 cs.SE · cs.DC

Wherefore Art Thou? Provenance-Guided Automatic Online Debugging with Lumos

Jingyuan Chen , Lei Zhang , Leon Schuermann , Gongqi Huang , Ravi Netravali , Amit Levy This is my paper

Pith reviewed 2026-05-14 20:57 UTC · model grok-4.3

classification 💻 cs.SE cs.DC

keywords online debuggingdistributed systemsbug provenancestatic analysisinstrumentationroot cause analysislow overhead

0 comments p. Extension

The pith

Lumos automatically captures bug provenance in distributed systems using static analysis to guide lightweight recording.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents Lumos, a framework for online debugging of distributed systems that focuses on exposing the computational history linking bug symptoms to root causes. It does this by using static analysis to identify relevant program states and then performing on-demand recording only when necessary. This approach aims to give developers sufficient evidence to diagnose issues after only a few bug occurrences while keeping runtime overhead low. A sympathetic reader would care because manual evidence collection in production is time-consuming and error-prone, especially for non-deterministic bugs.

Core claim

Lumos leverages dependency-guided instrumentation powered by static analysis to identify program state related to a bug's provenance and exposes them via lightweight on-demand recording, providing developers with enough evidence to identify a bug's root cause with low runtime overhead and given only a few occurrences of a bug.

What carries the argument

Dependency-guided instrumentation powered by static analysis, which selects program state for lightweight on-demand recording of bug provenances.

If this is right

Developers receive sufficient evidence for root cause identification without manual collection.
Runtime overhead remains low during normal operation due to on-demand recording.
Bug diagnosis succeeds after only a few occurrences rather than requiring many.
Application-level provenances become accessible for distributed system debugging.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Such provenance tracking could integrate with automated bug reporting systems to suggest fixes.
Extending the static analysis to handle more complex dependencies might improve coverage for certain bug types.
The low overhead suggests potential for continuous use in large-scale production environments.

Load-bearing premise

That static analysis can accurately identify the relevant program states for a bug's provenance and on-demand recording will capture enough evidence from few occurrences.

What would settle it

A deployment where Lumos fails to provide enough evidence for root cause identification even after multiple bug occurrences, or where the overhead is high enough to affect system performance noticeably.

Figures

Figures reproduced from arXiv: 2603.29013 by Amit Levy, Gongqi Huang, Jingyuan Chen, Lei Zhang, Leon Schuermann, Ravi Netravali.

**Figure 1.** Figure 1: Development Process of HDFS-4022 larly intricate bugs that complicate root-cause reasoning. Our experiments demonstrate that Lumos incurs practical runtime overheads (approx. 10% reduction in throughput at high loads on average) during an active debugging session, and identifies root causes within 5 occurrences. 2 Motivation Bugs in large-scale distributed systems can be complicated. Recent works have obse… view at source ↗

**Figure 2.** Figure 2: Lumos Architecture. that library functions are correct and therefore only collects application-space information. We argue that this design is reasonable under the observation that most real-world application bugs can be troubleshooted through applicationspace evidence. However, we stress that Lumos is capable of modeling application-space dependencies induced by library functions. We discuss this distinc… view at source ↗

**Figure 4.** Figure 4: Examples of drops in analysis precision caused by [PITH_FULL_IMAGE:figures/full_fig_p004_4.png] view at source ↗

**Figure 5.** Figure 5: Examples illustrating the analysis of library [PITH_FULL_IMAGE:figures/full_fig_p005_5.png] view at source ↗

**Figure 6.** Figure 6: Example of using timestamps to approximate [PITH_FULL_IMAGE:figures/full_fig_p006_6.png] view at source ↗

**Figure 7.** Figure 7: HDFS-4022 relevant code fragments [PITH_FULL_IMAGE:figures/full_fig_p008_7.png] view at source ↗

**Figure 9.** Figure 9: HDFS-5465&5479 relevant codes [PITH_FULL_IMAGE:figures/full_fig_p009_9.png] view at source ↗

**Figure 14.** Figure 14: Overhead comparison for RR and Lumos. The relative overheads in terms of max throughput and average latencies [PITH_FULL_IMAGE:figures/full_fig_p010_14.png] view at source ↗

**Figure 15.** Figure 15: Diagnosis Latency vs Overhead comparison [PITH_FULL_IMAGE:figures/full_fig_p010_15.png] view at source ↗

read the original abstract

Debugging distributed systems in-production is inevitable and hard. Myriad interactions between concurrent components in modern, complex and large-scale systems cause non-deterministic bugs that offline testing and verification fail to capture. When bugs surface at runtime, their root causes may be far removed from their symptoms. To identify a root cause, developers often need evidence scattered across multiple components and traces. Unfortunately, existing tools fail to quickly and automatically record useful provenance information at low overheads, leaving developers to manually perform the onerous evidence collection task. Lumos is an online debugging framework that exposes application-level bug provenances--the computational history linking symptoms of an incident to their root causes. Lumos leverages dependency-guided instrumentation powered by static analysis to identify program state related to a bug's provenance, and exposes them via lightweight on-demand recording. Lumos provides developers with enough evidence to identify a bug's root cause, while incurring low runtime overhead, and given only a few occurrences of a bug.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Lumos pairs static analysis with on-demand recording to surface bug provenance in distributed systems, but the approach may not fully handle runtime non-determinism.

read the letter

The main point is that Lumos uses dependency-guided static analysis to pick out relevant program state and then records it lightly on demand, so developers get enough traces to trace a bug back to its root cause after just a few occurrences and without big runtime cost. That combination is the concrete new piece for online debugging of non-deterministic distributed bugs. The paper does a clear job spelling out why current tools leave developers doing manual evidence collection across components, and the goal of keeping overhead low while still supplying usable provenance is a practical target. If the full text shows working instrumentation and some case studies on real systems, that part lands as useful engineering. The soft spot is the one the stress-test note flags: static analysis can miss dynamic scheduling, message ordering, and other concurrency effects that only appear at runtime. In distributed code those factors often drive the actual causal chain, so the recorded state might still be incomplete even after several bug hits. The abstract stays high-level and gives no numbers or examples, so the evaluation will have to demonstrate that the chosen state is sufficient in practice. No circular math or invented entities show up. This is for systems people who build or maintain large distributed services and need better debugging hooks. A reader already working on provenance or tracing tools would pick up usable design choices here. I would send it to peer review because the core framework is grounded enough to deserve a full check with the data and examples filled in.

Referee Report

2 major / 1 minor

Summary. The paper presents Lumos, an online debugging framework for distributed systems that uses dependency-guided instrumentation powered by static analysis to identify relevant program state and perform lightweight on-demand recording of bug provenance. It claims this provides developers with enough evidence to identify root causes of non-deterministic bugs, at low runtime overhead, after only a few occurrences.

Significance. If the central claims hold, Lumos would address a key practical challenge in production debugging of complex distributed systems by automating the collection of scattered provenance evidence that is currently manual and error-prone. This could meaningfully reduce developer effort for hard-to-reproduce bugs while keeping overhead low enough for in-production use.

major comments (2)

[Abstract] Abstract: The headline claim that Lumos 'provides developers with enough evidence to identify a bug's root cause' with 'low runtime overhead' and 'only a few occurrences of a bug' is unsupported by any evaluation data, measurements, case studies, or implementation details in the manuscript, making it impossible to assess whether the static-analysis approach actually delivers on the guarantee.
[Approach] Approach description (abstract and inferred §3): The reliance on dependency-guided static analysis to surface program state linking symptoms to root causes does not address runtime non-determinism such as dynamic scheduling, message ordering, and concurrency in distributed interactions; this risks incomplete provenance capture even after multiple occurrences, directly undermining the claim of sufficient evidence.

minor comments (1)

[Abstract] Abstract: Consider adding one sentence summarizing the target systems or languages and the evaluation methodology to give readers immediate context for the claims.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive review and for identifying areas where the manuscript's claims and approach description require stronger support and clarification. We address each major comment below and have made targeted revisions to improve the paper.

read point-by-point responses

Referee: [Abstract] Abstract: The headline claim that Lumos 'provides developers with enough evidence to identify a bug's root cause' with 'low runtime overhead' and 'only a few occurrences of a bug' is unsupported by any evaluation data, measurements, case studies, or implementation details in the manuscript, making it impossible to assess whether the static-analysis approach actually delivers on the guarantee.

Authors: We agree that the abstract presents strong claims that should be directly tied to supporting evidence. The full manuscript contains evaluation results in Section 5 (including overhead measurements and case studies on distributed systems) and implementation details in Sections 3 and 4. To address the concern, we have revised the abstract to qualify the claims with explicit references to these results and added a concise summary of key evaluation findings. We have also expanded the implementation description in Section 4 to provide more transparency on the static analysis. revision: yes
Referee: [Approach] Approach description (abstract and inferred §3): The reliance on dependency-guided static analysis to surface program state linking symptoms to root causes does not address runtime non-determinism such as dynamic scheduling, message ordering, and concurrency in distributed interactions; this risks incomplete provenance capture even after multiple occurrences, directly undermining the claim of sufficient evidence.

Authors: We appreciate this point on non-determinism. The dependency-guided static analysis identifies candidate program state based on data and control dependencies, which then guides lightweight on-demand recording of actual runtime executions. Collecting provenance across a few bug occurrences is intended to sample different interleavings and orderings that arise in practice. We acknowledge that this does not provide a formal guarantee of completeness for all possible non-deterministic schedules. We have added a dedicated limitations paragraph in Section 3 discussing runtime non-determinism, the role of multiple occurrences in mitigating it, and the assumptions under which the approach delivers usable evidence, along with pointers to the evaluation results that illustrate this in practice. revision: partial

Circularity Check

0 steps flagged

No circularity in derivation chain

full rationale

The paper describes a systems framework for provenance-guided debugging in distributed systems, relying on static analysis for dependency-guided instrumentation and on-demand recording. No mathematical derivations, equations, fitted parameters, or self-referential definitions appear in the provided text or abstract. The central claims about providing sufficient evidence with low overhead rest on the described design choices rather than reducing by construction to inputs, self-citations, or renamed known results. The approach is presented as a practical engineering solution without load-bearing circular steps.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No free parameters, axioms, or invented entities are identifiable from the abstract as this is a descriptive systems paper rather than a mathematical derivation.

pith-pipeline@v0.9.0 · 5478 in / 976 out tokens · 33596 ms · 2026-05-14T20:57:12.228349+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

46 extracted references · 46 canonical work pages

[1]

OpenTelemetry: High-quality, ubiquitous, and portable telemetry to enable effective observability

work page
[2]

HDFS-4022:Replication not happening for appended block

work page
[3]

Google cloud authentication failures

work page
[4]

SOLVED: Microsoft Azure AD issues

work page
[5]

HADOOP-5465:Blocks remain under-replicated

work page
[6]

Distributed I/O Benchmark of HDFS

work page
[7]

Karim Ali and Ondřej Lhoták. 2012. Application-only call graph construction. In European Conference on Object-Oriented Programming,

work page 2012
[8]

Frances E Allen. 1970. Control flow analysis. ACM Sigplan Notices 5, 7 (1970), 1–19

work page 1970
[9]

Anastasios Antoniadis, Nikos Filippakis, Paddy Krishnan, Raghaven- dra Ramesh, Nicholas Allen, and Yannis Smaragdakis. 2020. Static analysis of Java enterprise applications: frameworks and caches, the elephants in the room. In Proceedings of the 41st ACM SIGPLAN conference on programming language design and implementation, 2020. 794–807

work page 2020
[10]

Osbert Bastani, Rahul Sharma, Alex Aiken, and Percy Liang. 2018. Active learning of points-to specifications. In Proceedings of the 39th ACM SIGPLAN Conference on Programming Language Design and Implementation, 2018. 678–692

work page 2018
[11]

Rodrigo Fonseca, George Porter, Randy H Katz, and Scott Shenker

work page
[12]

In 4th USENIX Symposium on Networked Systems Design & Implementation (NSDI 07), 2007

{X-Trace}: A pervasive network tracing framework. In 4th USENIX Symposium on Networked Systems Design & Implementation (NSDI 07), 2007

work page 2007
[13]

Andy Georges, Mark Christiaens, Michiel Ronsse, and Koenraad De Bosschere. 2004. JaRec: a portable record/replay environment for multi-threaded Java applications. Software: practice and experience 34, 6 (2004), 523–547

work page 2004
[14]

Haryadi S Gunawi, Mingzhe Hao, Tanakorn Leesatapornwongsa, Tiratat Patana-anake, Thanh Do, Jeffry Adityatama, Kurnia J Eliazar, Agung Laksono, Jeffrey F Lukman, Vincentius Martin, and others

work page
[15]

In Proceedings of the ACM symposium on cloud computing ,

What bugs live in the cloud? a study of 3000+ issues in cloud systems. In Proceedings of the ACM symposium on cloud computing ,

work page
[16]

Zhenyu Guo, Xi Wang, Jian Tang, Xuezheng Liu, Zhilei Xu, Ming Wu, M Frans Kaashoek, and Zheng Zhang. 2008. R2: An Application-Level Kernel for Record and Replay. In OSDI, 2008. 193–208

work page 2008
[17]

Yigong Hu, Gongqi Huang, and Peng Huang. 2020. Automated reasoning and detection of specious configuration in large systems with symbolic execution. In 14th USENIX Symposium on Operating Systems Design and Implementation (OSDI 20), 2020. 719–734

work page 2020
[18]

Peng Huang, Chuanxiong Guo, Jacob R Lorch, Lidong Zhou, and Yingnong Dang. 2018. Capturing and enhancing in situ system observability for failure detection. In 13th USENIX Symposium on Operating Systems Design and Implementation (OSDI 18), 2018. 1–16

work page 2018
[19]

Jonathan Kaldor, Jonathan Mace, Michał Bejda, Edison Gao, Wiktor Kuropatwa, Joe O'Neill, Kian Win Ong, Bill Schaller, Pingjia Shan, Brendan Viscomi, and others. 2017. Canopy: An end-to-end perfor - mance tracing and analysis system. In Proceedings of the 26th symposium on operating systems principles, 2017. 34–50

work page 2017
[20]

Shuangxiang Kan, Yuekang Li, Weigang He, Zhenchang Xing, Liming Zhu, and Yulei Sui. 2025. Spectre: Automated Aliasing Specifications Generation for Library APIs with Fuzzing. ACM Transactions on Software Engineering and Methodology (2025)

work page 2025
[21]

Baris Kasikci, Weidong Cui, Xinyang Ge, and Ben Niu. 2017. Lazy diagnosis of in-production concurrency bugs. In Proceedings of the 26th Symposium on Operating Systems Principles, 2017. 582–598

work page 2017
[22]

Baris Kasikci, Benjamin Schubert, Cristiano Pereira, Gilles Pokam, and George Candea. 2015. Failure sketching: A technique for automated root cause diagnosis of in-production failures. In Proceedings of the 25th Symposium on Operating Systems Principles, 2015. 344–360

work page 2015
[23]

Baris Kasikci, Cristian Zamfir, and George Candea. 2013. RaceMob: Crowdsourced data race detection. In Proceedings of the twenty-fourth ACM symposium on operating systems principles, 2013. 406–422

work page 2013
[24]

Tanakorn Leesatapornwongsa, Jeffrey F Lukman, Shan Lu, and Haryadi S Gunawi. 2016. TaxDC: A taxonomy of non-deterministic concurrency bugs in datacenter distributed systems. In Proceedings of the twenty-first international conference on architectural support for programming languages and operating systems, 2016. 517–530

work page 2016
[25]

Yue Li, Tian Tan, Anders Møller, and Yannis Smaragdakis. 2020. A principled approach to selective context sensitivity for pointer analysis. ACM Transactions on Programming Languages and Systems (TOPLAS) 42, 2 (2020), 1–40

work page 2020
[26]

Haopeng Liu, Guangpu Li, Jeffrey F Lukman, Jiaxin Li, Shan Lu, Haryadi S Gunawi, and Chen Tian. 2017. Dcatch: Automatically de - tecting distributed concurrency bugs in cloud systems. ACM SIGARCH Computer Architecture News 45, 1 (2017), 677–691

work page 2017
[27]

Wenjie Ma, Shengyuan Yang, Tian Tan, Xiaoxing Ma, Chang Xu, and Yue Li. 2023. Context sensitivity without contexts: A cut-shortcut approach to fast and precise pointer analysis. Proceedings of the ACM on Programming Languages 7, PLDI (2023), 539–564

work page 2023
[28]

Jonathan Mace, Ryan Roelke, and Rodrigo Fonseca. 2018. Pivot tracing: Dynamic causal monitoring for distributed systems. ACM Transactions on Computer Systems (TOCS) 35, 4 (2018), 1–28

work page 2018
[29]

Robert O'Callahan, Chris Jones, Nathan Froyd, Kyle Huey, Albert Noll, and Nimrod Partush. 2017. Engineering record and replay for deployability. In 2017 USENIX Annual Technical Conference (USENIX ATC 17), 2017. 377–389

work page 2017
[30]

Ernest Pobee and Wing Kwong Chan. 2019. Aggreplay: Efficient record and replay of multi-threaded programs. In Proceedings of the 2019 27th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering, 2019. 567–577

work page 2019
[31]

Alexandru Salcianu and Martin Rinard. 2001. Pointer and escape analysis for multithreaded programs. ACM SIGPLAN Notices 36, 7 (2001), 12–23

work page 2001
[32]

Benjamin H Sigelman, Luiz André Barroso, Mike Burrows, Pat Stephenson, Manoj Plakal, Donald Beaver, Saul Jaspan, and Chandan Shanbhag. 2010. Dapper, a large-scale distributed systems tracing infrastructure. (2010)

work page 2010
[33]

Yannis Smaragdakis, George Kastrinis, and George Balatsouras

work page
[34]

In Proceedings of the 35th ACM SIGPLAN Conference on Programming Language Design and Implementation, 2014

Introspective analysis: context-sensitivity, across the board. In Proceedings of the 35th ACM SIGPLAN Conference on Programming Language Design and Implementation, 2014. 485–495

work page 2014
[35]

Lilia Tang, Chaitanya Bhandari, Yongle Zhang, Anna Karanika, Shuyang Ji, Indranil Gupta, and Tianyin Xu. 2023. Fail through the Cracks: Cross-System Interaction Failures in Modern Cloud Systems. In Proceedings of the Eighteenth European Conference on Computer Systems, 2023. 433–451

work page 2023
[36]

Raja Vallée-Rai, Phong Co, Etienne Gagnon, Laurie Hendren, Patrick Lam, and Vijay Sundaresan. 2010. Soot: A Java bytecode optimization framework. CASCON First Decade High Impact Papers, 214–224

work page 2010
[37]

Chengpeng Wang, Jipeng Zhang, Rongxin Wu, and Charles Zhang

work page
[38]

Proceedings of the ACM on Software Engineering 1, FSE (2024), 2469–2492

Dainfer: Inferring API aliasing specifications from library documentation via neurosymbolic optimization. Proceedings of the ACM on Software Engineering 1, FSE (2024), 2469–2492

work page 2024
[39]

Mark Weiser. 2009. Program slicing. IEEE Transactions on software engineering 4 (2009), 352–357

work page 2009
[40]

Ming Wu, Fan Long, Xi Wang, Zhilei Xu, Haoxiang Lin, Xuezheng Liu, Zhenyu Guo, Huayang Guo, Lidong Zhou, and Zheng Zhang

work page
[41]

In Proceedings of the eighteenth ACM SIGSOFT international symposium on Foundations of software engineering, 2010

Language-based replay via data flow cut. In Proceedings of the eighteenth ACM SIGSOFT international symposium on Foundations of software engineering, 2010. 197–206

work page 2010
[42]

Ding Yuan, Yu Luo, Xin Zhuang, Guilherme Renna Rodrigues, Xu Zhao, Yongle Zhang, Pranay U Jain, and Michael Stumm. 2014. Simple testing can prevent most critical failures: An analysis of production failures in distributed {Data-Intensive } systems. In 11th USENIX Symposium on Operating Systems Design and Implementation (OSDI 14), 2014. 249–265. 12

work page 2014
[43]

Cristian Zamfir and George Candea. 2010. Execution synthesis: a technique for automated software debugging. In Proceedings of the 5th European conference on Computer systems, 2010. 321–334

work page 2010
[44]

Lei Zhang, Zhiqiang Xie, Vaastav Anand, Ymir Vigfusson, and Jonathan Mace. 2023. The Benefit of Hindsight: Tracing {Edge-Cases } in Distributed Systems. In 20th USENIX Symposium on Networked Systems Design and Implementation (NSDI 23), 2023. 321–339

work page 2023
[45]

Yongle Zhang, Serguei Makarov, Xiang Ren, David Lion, and Ding Yuan. 2017. Pensieve: Non-intrusive failure reproduction for distrib- uted systems using the event chaining approach. In Proceedings of the 26th Symposium on Operating Systems Principles, 2017. 19–33

work page 2017
[46]

Gefei Zuo, Jiacheng Ma, Andrew Quinn, Pramod Bhatotia, Pedro Fonseca, and Baris Kasikci. 2021. Execution reconstruction: Harness- ing failure reoccurrences for failure reproduction. In Proceedings of the 42nd ACM SIGPLAN International Conference on Programming Language Design and Implementation, 2021. 1155–1170. 13

work page 2021

[1] [1]

OpenTelemetry: High-quality, ubiquitous, and portable telemetry to enable effective observability

work page

[2] [2]

HDFS-4022:Replication not happening for appended block

work page

[3] [3]

Google cloud authentication failures

work page

[4] [4]

SOLVED: Microsoft Azure AD issues

work page

[5] [5]

HADOOP-5465:Blocks remain under-replicated

work page

[6] [6]

Distributed I/O Benchmark of HDFS

work page

[7] [7]

Karim Ali and Ondřej Lhoták. 2012. Application-only call graph construction. In European Conference on Object-Oriented Programming,

work page 2012

[8] [8]

Frances E Allen. 1970. Control flow analysis. ACM Sigplan Notices 5, 7 (1970), 1–19

work page 1970

[9] [9]

Anastasios Antoniadis, Nikos Filippakis, Paddy Krishnan, Raghaven- dra Ramesh, Nicholas Allen, and Yannis Smaragdakis. 2020. Static analysis of Java enterprise applications: frameworks and caches, the elephants in the room. In Proceedings of the 41st ACM SIGPLAN conference on programming language design and implementation, 2020. 794–807

work page 2020

[10] [10]

Osbert Bastani, Rahul Sharma, Alex Aiken, and Percy Liang. 2018. Active learning of points-to specifications. In Proceedings of the 39th ACM SIGPLAN Conference on Programming Language Design and Implementation, 2018. 678–692

work page 2018

[11] [11]

Rodrigo Fonseca, George Porter, Randy H Katz, and Scott Shenker

work page

[12] [12]

In 4th USENIX Symposium on Networked Systems Design & Implementation (NSDI 07), 2007

{X-Trace}: A pervasive network tracing framework. In 4th USENIX Symposium on Networked Systems Design & Implementation (NSDI 07), 2007

work page 2007

[13] [13]

Andy Georges, Mark Christiaens, Michiel Ronsse, and Koenraad De Bosschere. 2004. JaRec: a portable record/replay environment for multi-threaded Java applications. Software: practice and experience 34, 6 (2004), 523–547

work page 2004

[14] [14]

Haryadi S Gunawi, Mingzhe Hao, Tanakorn Leesatapornwongsa, Tiratat Patana-anake, Thanh Do, Jeffry Adityatama, Kurnia J Eliazar, Agung Laksono, Jeffrey F Lukman, Vincentius Martin, and others

work page

[15] [15]

In Proceedings of the ACM symposium on cloud computing ,

What bugs live in the cloud? a study of 3000+ issues in cloud systems. In Proceedings of the ACM symposium on cloud computing ,

work page

[16] [16]

Zhenyu Guo, Xi Wang, Jian Tang, Xuezheng Liu, Zhilei Xu, Ming Wu, M Frans Kaashoek, and Zheng Zhang. 2008. R2: An Application-Level Kernel for Record and Replay. In OSDI, 2008. 193–208

work page 2008

[17] [17]

Yigong Hu, Gongqi Huang, and Peng Huang. 2020. Automated reasoning and detection of specious configuration in large systems with symbolic execution. In 14th USENIX Symposium on Operating Systems Design and Implementation (OSDI 20), 2020. 719–734

work page 2020

[18] [18]

Peng Huang, Chuanxiong Guo, Jacob R Lorch, Lidong Zhou, and Yingnong Dang. 2018. Capturing and enhancing in situ system observability for failure detection. In 13th USENIX Symposium on Operating Systems Design and Implementation (OSDI 18), 2018. 1–16

work page 2018

[19] [19]

Jonathan Kaldor, Jonathan Mace, Michał Bejda, Edison Gao, Wiktor Kuropatwa, Joe O'Neill, Kian Win Ong, Bill Schaller, Pingjia Shan, Brendan Viscomi, and others. 2017. Canopy: An end-to-end perfor - mance tracing and analysis system. In Proceedings of the 26th symposium on operating systems principles, 2017. 34–50

work page 2017

[20] [20]

Shuangxiang Kan, Yuekang Li, Weigang He, Zhenchang Xing, Liming Zhu, and Yulei Sui. 2025. Spectre: Automated Aliasing Specifications Generation for Library APIs with Fuzzing. ACM Transactions on Software Engineering and Methodology (2025)

work page 2025

[21] [21]

Baris Kasikci, Weidong Cui, Xinyang Ge, and Ben Niu. 2017. Lazy diagnosis of in-production concurrency bugs. In Proceedings of the 26th Symposium on Operating Systems Principles, 2017. 582–598

work page 2017

[22] [22]

Baris Kasikci, Benjamin Schubert, Cristiano Pereira, Gilles Pokam, and George Candea. 2015. Failure sketching: A technique for automated root cause diagnosis of in-production failures. In Proceedings of the 25th Symposium on Operating Systems Principles, 2015. 344–360

work page 2015

[23] [23]

Baris Kasikci, Cristian Zamfir, and George Candea. 2013. RaceMob: Crowdsourced data race detection. In Proceedings of the twenty-fourth ACM symposium on operating systems principles, 2013. 406–422

work page 2013

[24] [24]

Tanakorn Leesatapornwongsa, Jeffrey F Lukman, Shan Lu, and Haryadi S Gunawi. 2016. TaxDC: A taxonomy of non-deterministic concurrency bugs in datacenter distributed systems. In Proceedings of the twenty-first international conference on architectural support for programming languages and operating systems, 2016. 517–530

work page 2016

[25] [25]

Yue Li, Tian Tan, Anders Møller, and Yannis Smaragdakis. 2020. A principled approach to selective context sensitivity for pointer analysis. ACM Transactions on Programming Languages and Systems (TOPLAS) 42, 2 (2020), 1–40

work page 2020

[26] [26]

Haopeng Liu, Guangpu Li, Jeffrey F Lukman, Jiaxin Li, Shan Lu, Haryadi S Gunawi, and Chen Tian. 2017. Dcatch: Automatically de - tecting distributed concurrency bugs in cloud systems. ACM SIGARCH Computer Architecture News 45, 1 (2017), 677–691

work page 2017

[27] [27]

Wenjie Ma, Shengyuan Yang, Tian Tan, Xiaoxing Ma, Chang Xu, and Yue Li. 2023. Context sensitivity without contexts: A cut-shortcut approach to fast and precise pointer analysis. Proceedings of the ACM on Programming Languages 7, PLDI (2023), 539–564

work page 2023

[28] [28]

Jonathan Mace, Ryan Roelke, and Rodrigo Fonseca. 2018. Pivot tracing: Dynamic causal monitoring for distributed systems. ACM Transactions on Computer Systems (TOCS) 35, 4 (2018), 1–28

work page 2018

[29] [29]

Robert O'Callahan, Chris Jones, Nathan Froyd, Kyle Huey, Albert Noll, and Nimrod Partush. 2017. Engineering record and replay for deployability. In 2017 USENIX Annual Technical Conference (USENIX ATC 17), 2017. 377–389

work page 2017

[30] [30]

Ernest Pobee and Wing Kwong Chan. 2019. Aggreplay: Efficient record and replay of multi-threaded programs. In Proceedings of the 2019 27th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering, 2019. 567–577

work page 2019

[31] [31]

Alexandru Salcianu and Martin Rinard. 2001. Pointer and escape analysis for multithreaded programs. ACM SIGPLAN Notices 36, 7 (2001), 12–23

work page 2001

[32] [32]

Benjamin H Sigelman, Luiz André Barroso, Mike Burrows, Pat Stephenson, Manoj Plakal, Donald Beaver, Saul Jaspan, and Chandan Shanbhag. 2010. Dapper, a large-scale distributed systems tracing infrastructure. (2010)

work page 2010

[33] [33]

Yannis Smaragdakis, George Kastrinis, and George Balatsouras

work page

[34] [34]

In Proceedings of the 35th ACM SIGPLAN Conference on Programming Language Design and Implementation, 2014

Introspective analysis: context-sensitivity, across the board. In Proceedings of the 35th ACM SIGPLAN Conference on Programming Language Design and Implementation, 2014. 485–495

work page 2014

[35] [35]

Lilia Tang, Chaitanya Bhandari, Yongle Zhang, Anna Karanika, Shuyang Ji, Indranil Gupta, and Tianyin Xu. 2023. Fail through the Cracks: Cross-System Interaction Failures in Modern Cloud Systems. In Proceedings of the Eighteenth European Conference on Computer Systems, 2023. 433–451

work page 2023

[36] [36]

Raja Vallée-Rai, Phong Co, Etienne Gagnon, Laurie Hendren, Patrick Lam, and Vijay Sundaresan. 2010. Soot: A Java bytecode optimization framework. CASCON First Decade High Impact Papers, 214–224

work page 2010

[37] [37]

Chengpeng Wang, Jipeng Zhang, Rongxin Wu, and Charles Zhang

work page

[38] [38]

Proceedings of the ACM on Software Engineering 1, FSE (2024), 2469–2492

Dainfer: Inferring API aliasing specifications from library documentation via neurosymbolic optimization. Proceedings of the ACM on Software Engineering 1, FSE (2024), 2469–2492

work page 2024

[39] [39]

Mark Weiser. 2009. Program slicing. IEEE Transactions on software engineering 4 (2009), 352–357

work page 2009

[40] [40]

Ming Wu, Fan Long, Xi Wang, Zhilei Xu, Haoxiang Lin, Xuezheng Liu, Zhenyu Guo, Huayang Guo, Lidong Zhou, and Zheng Zhang

work page

[41] [41]

In Proceedings of the eighteenth ACM SIGSOFT international symposium on Foundations of software engineering, 2010

Language-based replay via data flow cut. In Proceedings of the eighteenth ACM SIGSOFT international symposium on Foundations of software engineering, 2010. 197–206

work page 2010

[42] [42]

Ding Yuan, Yu Luo, Xin Zhuang, Guilherme Renna Rodrigues, Xu Zhao, Yongle Zhang, Pranay U Jain, and Michael Stumm. 2014. Simple testing can prevent most critical failures: An analysis of production failures in distributed {Data-Intensive } systems. In 11th USENIX Symposium on Operating Systems Design and Implementation (OSDI 14), 2014. 249–265. 12

work page 2014

[43] [43]

Cristian Zamfir and George Candea. 2010. Execution synthesis: a technique for automated software debugging. In Proceedings of the 5th European conference on Computer systems, 2010. 321–334

work page 2010

[44] [44]

Lei Zhang, Zhiqiang Xie, Vaastav Anand, Ymir Vigfusson, and Jonathan Mace. 2023. The Benefit of Hindsight: Tracing {Edge-Cases } in Distributed Systems. In 20th USENIX Symposium on Networked Systems Design and Implementation (NSDI 23), 2023. 321–339

work page 2023

[45] [45]

Yongle Zhang, Serguei Makarov, Xiang Ren, David Lion, and Ding Yuan. 2017. Pensieve: Non-intrusive failure reproduction for distrib- uted systems using the event chaining approach. In Proceedings of the 26th Symposium on Operating Systems Principles, 2017. 19–33

work page 2017

[46] [46]

Gefei Zuo, Jiacheng Ma, Andrew Quinn, Pramod Bhatotia, Pedro Fonseca, and Baris Kasikci. 2021. Execution reconstruction: Harness- ing failure reoccurrences for failure reproduction. In Proceedings of the 42nd ACM SIGPLAN International Conference on Programming Language Design and Implementation, 2021. 1155–1170. 13

work page 2021