Provenance for Large-scale Datalog
Pith reviewed 2026-05-24 22:59 UTC · model grok-4.3
The pith
A new provenance lattice with proof annotations and adjusted semi-naive semantics lets Datalog programs answer arbitrary debugging queries on large analyses.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
By introducing a provenance lattice that carries proof annotations alongside ordinary Datalog facts and by redefining the fixed-point semantics of semi-naive evaluation to propagate those annotations, the engine can answer arbitrary provenance queries that construct partial proof trees of minimal height for any output tuple, all while preserving the original program semantics and incurring only modest runtime overhead on large-scale inputs such as DOOP analyses.
What carries the argument
The provenance lattice augmented with proof annotations, together with the modified fixed-point operator for semi-naive evaluation that records derivation information during iteration.
If this is right
- Debugging queries can request provenance for any tuple and receive a minimal-height partial proof tree on demand.
- The same engine can produce provenance for tens of millions of tuples produced by hundreds of recursive rules.
- Runtime cost stays within a factor of roughly 1.27 compared with ordinary evaluation.
- The technique supports more flexible queries than previous state-of-the-art provenance methods for Datalog.
Where Pith is reading between the lines
- The same lattice construction could be applied to other bottom-up logic engines that already use semi-naive evaluation.
- Minimal-height proof trees might serve as concise explanations when an analysis reports an unexpected result.
- Interactive tools could cache the annotated facts and answer successive debugging queries without restarting evaluation.
Load-bearing premise
The new lattice and modified semi-naive fixed-point semantics preserve the original Datalog semantics while correctly recording all proof information needed for later debugging queries.
What would settle it
Execute the instrumented engine on a small, manually verifiable Datalog program whose complete set of minimal derivations is already known, then compare the proof trees returned by debugging queries against the expected derivations and check that no extra or missing tuples appear in the output.
Figures
read the original abstract
Logic programming languages such as Datalog have become popular as Domain Specific Languages (DSLs) for solving large-scale, real-world problems, in particular, static program analysis and network analysis. The logic specifications which model analysis problems, process millions of tuples of data and contain hundreds of highly recursive rules. As a result, they are notoriously difficult to debug. While the database community has proposed several data-provenance techniques that address the Declarative Debugging Challenge for Databases, in the cases of analysis problems, these state-of-the-art techniques do not scale. In this paper, we introduce a novel bottom-up Datalog evaluation strategy for debugging: our provenance evaluation strategy relies on a new provenance lattice that includes proof annotations, and a new fixed-point semantics for semi-naive evaluation. A debugging query mechanism allows arbitrary provenance queries, constructing partial proof trees of tuples with minimal height. We integrate our technique into Souffle, a Datalog engine that synthesizes C++ code, and achieve high performance by using specialized parallel data structures. Experiments are conducted with DOOP/DaCapo, producing proof annotations for tens of millions of output tuples. We show that our method has a runtime overhead of 1.27x on average while being more flexible than existing state-of-the-art techniques.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims to introduce a novel bottom-up Datalog evaluation strategy for debugging large-scale programs (e.g., static analyses with millions of tuples and hundreds of recursive rules). It relies on a new provenance lattice that includes proof annotations together with a modified fixed-point semantics for semi-naive evaluation; this enables a debugging query mechanism that constructs partial proof trees of minimal height. The approach is integrated into Soufflé via specialized parallel data structures and is evaluated on DOOP/DaCapo, reporting an average runtime overhead of 1.27× while producing annotations for tens of millions of output tuples and offering greater flexibility than prior provenance techniques.
Significance. If the lattice definition and semantic preservation claims hold, the work supplies a practical, scalable provenance mechanism for Datalog DSLs used in program and network analysis. The combination of proof annotations, minimal-height trees, and low-overhead parallel implementation in Soufflé addresses a recognized debugging bottleneck; the reported experiments on real-world benchmarks constitute a concrete strength.
major comments (2)
- [§3 (Semantics)] The central claim that the new provenance lattice and modified semi-naive fixed-point semantics preserve the original Datalog least-fixed-point semantics while correctly recording all necessary proof information is load-bearing; the manuscript must supply the lattice definition, the revised immediate-consequence operator, and a proof of equivalence (or a machine-checked statement) to allow verification that no tuples are added or lost.
- [§5 (Evaluation)] Table 1 (or equivalent experimental summary) reports 1.27× average overhead; the paper should clarify whether this figure includes the cost of materializing the proof annotations for all output tuples or only the base evaluation, because the former is required to support the scalability claim for tens of millions of tuples.
minor comments (2)
- [§2–3] Notation for the provenance lattice elements (e.g., how proof annotations are encoded inside the lattice values) should be introduced once and used consistently; the abstract uses “proof annotations” while later sections appear to switch between “annotations” and “labels.”
- [§4] The description of the debugging query mechanism would benefit from a small worked example showing a concrete provenance query, the resulting partial proof tree, and how minimality of height is enforced.
Simulated Author's Rebuttal
We thank the referee for the constructive review and the recommendation of minor revision. We address each major comment below.
read point-by-point responses
-
Referee: [§3 (Semantics)] The central claim that the new provenance lattice and modified semi-naive fixed-point semantics preserve the original Datalog least-fixed-point semantics while correctly recording all necessary proof information is load-bearing; the manuscript must supply the lattice definition, the revised immediate-consequence operator, and a proof of equivalence (or a machine-checked statement) to allow verification that no tuples are added or lost.
Authors: We agree that explicit verification of semantic preservation is essential. The revised manuscript will expand §3 with the complete definition of the provenance lattice, the revised immediate-consequence operator used in the modified semi-naive evaluation, and a formal proof (or machine-checkable statement) establishing equivalence to the standard least fixed-point semantics, confirming that no tuples are added or lost. revision: yes
-
Referee: [§5 (Evaluation)] Table 1 (or equivalent experimental summary) reports 1.27× average overhead; the paper should clarify whether this figure includes the cost of materializing the proof annotations for all output tuples or only the base evaluation, because the former is required to support the scalability claim for tens of millions of tuples.
Authors: The 1.27× figure in Table 1 measures the full provenance-enabled evaluation, which includes materialization of proof annotations for all output tuples. We will add an explicit statement to this effect in the evaluation section of the revised manuscript to remove any ambiguity. revision: yes
Circularity Check
No significant circularity detected
full rationale
The paper claims a novel provenance lattice with proof annotations and modified fixed-point semantics for semi-naive Datalog evaluation that preserves original semantics while supporting minimal-height proof trees. No equations, lattice definitions, or semantic derivations are supplied in the abstract or visible text that reduce any result to its inputs by construction. No self-citations are invoked as load-bearing premises, no parameters are fitted then renamed as predictions, and no uniqueness theorems or ansatzes are smuggled via prior work. The derivation chain is presented as building on standard least-fixed-point Datalog evaluation with independent new structures, making the central claims self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
axioms (1)
- standard math Fixed-point semantics for semi-naive evaluation of Datalog
invented entities (1)
-
provenance lattice with proof annotations
no independent evidence
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/ArithmeticFromLogic.leanLogicNat recovery; no lattice-height minimization unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
our provenance evaluation strategy relies on a new provenance lattice that includes proof annotations, and a new fixed-point semantics for semi-naive evaluation... TP((I,h)) = (ΓP(I), h′) where h′(t) = min_g∈Gt {max ti∈g h(ti)} + 1
-
IndisputableMonolith/Foundation/AbsoluteFloorClosure.leanabsolute_floor_iff_bare_distinguishability unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Definition 3 (Provenance Instance)... Lemma 1. ⊑ is a partial order... Theorem 1. ... height annotations are minimal
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
souffl´ e synthesizes a native parallel c++ program from a logic specifi cation
souffle-lang/souffle: Souffl´ e is a variant of datalog for tool desig ners crafting analyses in horn clauses. souffl´ e synthesizes a native parallel c++ program from a logic specifi cation. more information can be found here: http://souffle-lang.github.io/, 2017. Accessed: 19-1 0-2017
work page 2017
-
[2]
S. Abiteboul, R. Hull, and V. Vianu. Foundations of Databases . Addison-Wesley Publishing Company, 1995
work page 1995
- [3]
-
[4]
B. S. Arab, D. Gawlick, V. Krishnaswamy, V. Radhakrishnan, and B. Glavic. Using reenactment to retroactively capture provenance for transactions. IEEE Transactions on Knowledge and Data Engi- neering, 30(3):599–612, March 2018
work page 2018
-
[5]
M. Aref, B. ten Cate, T. J. Green, B. Kimelfeld, D. Olteanu, E. Pa salic, T. L. Veldhuizen, and G. Wash- burn. Design and implementation of the logicblox system. In Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data , SIGMOD ’15, pages 1371–1382, New York, NY, USA, 2015. ACM
work page 2015
- [6]
-
[7]
O. Benjelloun, A. D. Sarma, C. Hayworth, and J. Widom. An introd uction to uldbs and the trio system. IEEE Data Engineering Bulletin , 2006
work page 2006
-
[8]
M. Bravenboer and Y. Smaragdakis. Strictly declarative specific ation of sophisticated points-to analyses. SIGPLAN Not. , 44(10):243–262, 2009
work page 2009
-
[9]
P. Buneman, S. Khanna, and W.-C. Tan. Why and where: A charac terization of data provenance. Proceedings of the International Conference on Database Th eory, 1973:316–330, 2001
work page 1973
-
[10]
R. Caballero, Y. Garc ´ ıa-Ruiz, and F. S´ aenz-P´ erez. A theoretical framework for the declarative debugging of datalog programs. In K.-D. Schewe and B. Thalheim, editors, Semantics in Data and Knowledge Bases, pages 143–159, Berlin, Heidelberg, 2008. Springer Berlin Heidelber g. 26
work page 2008
-
[11]
R. Caballero, Y. Garca-Ruiz, and F. Senz-Prez. Debugging of w rong and missing answers for datalog programs with constraint handling rules. 07 2015
work page 2015
- [12]
- [13]
- [14]
- [15]
-
[16]
W. Drabent, S. Nadjm-Tehrani, and J. Maluszy´ nski. Meta-pr ogramming in logic programming. pages 501–521, 1989
work page 1989
- [17]
-
[18]
B. Glavic and G. Alonso. Perm: Processing provenance and data on the same data model through query rewriting. IEEE International Conference on Data Engineering , 25:174–185, 2009
work page 2009
- [19]
- [20]
- [21]
- [22]
-
[23]
S. S. Huang, T. J. Green, and B. T. Loo. Datalog and emerging a pplications: An interactive tutorial. In Proceedings of the 2011 ACM SIGMOD International Conferenc e on Management of Data , SIGMOD ’11, pages 1213–1216. ACM, 2011
work page 2011
- [24]
- [25]
-
[26]
S. K¨ ohler, B. Lud¨ ascher, and Y. Smaragdakis. Declarative datalog debugging for mere mortals. Lecture Notes in Computer Science , 7494:111–122, 2012
work page 2012
-
[27]
S. Lee, S. K¨ ohler, B. Lud¨ ascher, and B. Glavic. Efficiently com puting provenance graphs for queries with negation. CoRR, abs/1701.05699, 2017
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[28]
S. Lee, B. Lud¨ ascher, and B. Glavic. Provenance summaries f or answers and non-answers. Proc. VLDB Endow., 11(12):1954–1957, Aug. 2018
work page 1954
- [29]
-
[30]
M. Madsen, M.-H. Yee, and O. Lhot´ ak. From datalog to flix: A de clarative language for fixed points on lattices. In Proceedings of the 37th ACM SIGPLAN Conference on Programmi ng Language Design and Implementation , PLDI ’16, pages 194–208, New York, NY, USA, 2016. ACM
work page 2016
- [31]
-
[32]
X. Ou, S. Govindavajhala, and A. W. Appel. Mulval: A logic-based n etwork security analyzer. In Proceedings of the 14th Conference on USENIX Security Sympo sium - Volume 14 , SSYM’05, pages 8–8, Berkeley, CA, USA, 2005. USENIX Association
work page 2005
-
[33]
M. Raghothaman, S. Kulkarni, K. Heo, and M. Naik. User-guided program reasoning using bayesian inference. pages 722–735, 06 2018
work page 2018
-
[34]
R. Ramakrishnan and S. Sudarshan. Top-down vs. bottom-up revisited. In In Proceedings of the International Logic Programming Symposium , pages 321–336. MIT Press, 1991
work page 1991
-
[35]
E. Y. Shapiro. Algorithmic Program DeBugging . MIT Press, Cambridge, MA, USA, 1983
work page 1983
-
[36]
M. Sridharan, D. Gopan, L. Shan, and R. Bod ´ ık. Demand-drive n points-to analysis for java. In Proceedings of the 20th Annual ACM SIGPLAN Conference on Obj ect-oriented Programming, Systems, Languages, and Applications , OOPSLA ’05, pages 59–76, New York, NY, USA, 2005. ACM
work page 2005
-
[37]
M. Stamatogiannakis, P. Groth, and H. Bos. Decoupling proven ance capture and analysis from exe- cution. In 7th USENIX Workshop on the Theory and Practice of Provenance (TaPP 15) , Edinburgh, Scotland, 2015. USENIX Association
work page 2015
-
[38]
P. Subotic, H. Jordan, L. Chang, A. Fekete, and B. Scholz. Au tomatic index selection for large-scale datalog computation. PVLDB, 12(2):141–153, 2018
work page 2018
-
[39]
J. D. Ullman. Bottom-up beats top-down for datalog. In Proceedings of the Eighth ACM SIGACT- SIGMOD-SIGART Symposium on Principles of Database Systems , PODS ’89, pages 140–149, New York, NY, USA, 1989. ACM
work page 1989
- [40]
-
[41]
J. Widom. Trio: A system for integrated management of data, a ccuracy, and lineage. In 2nd Biennial Conference on Innovative Data Systems Research, CIDR 2005 , pages 262–276, 01 2005
work page 2005
- [42]
- [43]
-
[44]
D. Zhao. Honours thesis: Large-scale provenance for souffl´ e. 2017
work page 2017
-
[45]
W. Zhou, M. Sherr, T. Tao, X. Li, B. T. Loo, and Y. Mao. Efficient querying and maintenance of network provenance at internet-scale. Proceedings of the 2010 ACM SIGMOD International Conferenc e on Management of Data , pages 615–626, 2010. 28
work page 2010
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.