pith. sign in

arxiv: 1906.10261 · v1 · pith:YMIMPAOPnew · submitted 2019-06-24 · 💻 cs.DB · cs.DC· cs.LO

Datalog Materialisation in Distributed RDF Stores with Dynamic Data Exchange

Pith reviewed 2026-05-25 16:26 UTC · model grok-4.3

classification 💻 cs.DB cs.DCcs.LO
keywords Datalog materialisationDistributed RDF storesSeminaive evaluationDynamic data exchangeReasoning over RDFArbitrary datalog rules
0
0 comments X

The pith

Dynamic data exchange extends to arbitrary datalog materialisation in distributed RDF stores while avoiding repeated inferences.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper adapts an existing technique for answering queries across many servers to the task of precomputing all facts implied by datalog rules over RDF data. This produces a distributed version of the seminaive materialisation algorithm that works for any rules rather than restricted fragments. The extension keeps the guarantee that each inference is derived only once. Experiments show the resulting system processes datasets too large for a single machine.

Core claim

The dynamic data exchange approach for distributed query answering can be extended to a reasoning algorithm that handles arbitrary datalog rules while preserving non-repetition of inferences.

What carries the argument

Dynamic data exchange mechanism, which coordinates the movement of newly derived facts between servers during evaluation.

If this is right

  • Distributed RDF stores can now support full datalog reasoning instead of being limited to queries or simple rules.
  • Non-repetition of inferences remains guaranteed in the distributed setting.
  • Materialisation becomes feasible for RDF datasets that exceed single-machine memory.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same coordination layer might support hybrid workloads that mix materialisation and on-demand queries.
  • The approach could reduce the need for rule restrictions in distributed knowledge graphs.
  • Further scaling tests on clusters with thousands of nodes would clarify practical limits.

Load-bearing premise

The coordination rules used for queries can be lifted to arbitrary datalog rules without creating extra messages that cause the same inference to be recomputed on multiple servers.

What would settle it

An implementation that produces duplicate inferences on any test rule set or that fails to finish on an RDF dataset larger than centralised systems can handle would falsify the central claim.

read the original abstract

Several centralised RDF systems support datalog reasoning by precomputing and storing all logically implied triples using the wellknown seminaive algorithm. Large RDF datasets often exceed the capacity of centralised RDF systems, and a common solution is to distribute the datasets in a cluster of shared-nothing servers. While numerous distributed query answering techniques are known, distributed seminaive evaluation of arbitrary datalog rules is less understood. In fact, most distributed RDF stores either support no reasoning or can handle only limited datalog fragments. In this paper we extend the dynamic data exchange approach for distributed query answering by Potter et al. [12] to a reasoning algorithm that can handle arbitrary rules while preserving important properties such as nonrepetition of inferences. We also show empirically that our algorithm scales well to very large RDF datasets

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 3 minor

Summary. The paper extends the dynamic data exchange approach for distributed query answering by Potter et al. [12] to support Datalog materialisation over arbitrary rules in distributed RDF stores. It claims that the extension preserves non-repetition of inferences and provides empirical evidence that the resulting algorithm scales to very large RDF datasets.

Significance. If the extension correctly preserves non-repetition without introducing prohibitive coordination overhead, the work would address a notable gap: most distributed RDF stores support either no reasoning or only limited Datalog fragments. The empirical scaling results on large datasets would strengthen the practical relevance of the contribution.

major comments (1)
  1. [Section 4 (algorithm and correctness)] The central claim rests on extending the base dynamic data exchange technique to arbitrary (including recursive) rules while preserving non-repetition. The manuscript should provide an explicit argument or proof sketch for this preservation property under the distributed setting; without it the extension's correctness remains difficult to verify from the high-level description.
minor comments (3)
  1. [Abstract] 'wellknown' should be hyphenated as 'well-known'.
  2. [Abstract] 'nonrepetition' should be written as 'non-repetition' for consistency with standard terminology.
  3. [Section 3] The description of how the dynamic exchange mechanism is adapted for rule bodies containing multiple atoms would benefit from a small worked example to clarify message patterns.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive review and the recommendation of minor revision. The suggestion to strengthen the correctness argument is well taken, and we will incorporate an explicit proof sketch as described below.

read point-by-point responses
  1. Referee: [Section 4 (algorithm and correctness)] The central claim rests on extending the base dynamic data exchange technique to arbitrary (including recursive) rules while preserving non-repetition. The manuscript should provide an explicit argument or proof sketch for this preservation property under the distributed setting; without it the extension's correctness remains difficult to verify from the high-level description.

    Authors: We agree that an explicit argument is needed for clarity. The manuscript in Section 4 describes the extension of dynamic data exchange to seminaive materialization and asserts preservation of non-repetition by inheriting the base technique's invariants while handling recursion via stratified evaluation and dynamic tuple exchange. However, the high-level presentation does not include a self-contained proof sketch. In the revision we will add a concise proof sketch (approximately one page) in Section 4 that shows: (i) each inferred triple is derived exactly once by the local seminaive evaluator, (ii) the dynamic exchange protocol ensures that a fact is sent to a peer only when it is newly derived and not previously exchanged, and (iii) recursion does not violate this because the stratification and delta-based propagation maintain the same non-repetition guarantee as the centralized case. This will make the correctness argument verifiable without requiring the reader to reconstruct it from the base paper [12]. revision: yes

Circularity Check

0 steps flagged

No significant circularity; extension relies on external prior work

full rationale

The paper's central contribution is an algorithmic extension of the dynamic data exchange mechanism from Potter et al. [12] (external authors) to arbitrary Datalog rules for materialisation, preserving non-repetition of inferences. No self-citation load-bearing, no self-definitional steps, no fitted inputs renamed as predictions, and no ansatz or uniqueness claims imported from the authors' own prior work. The derivation chain is self-contained against the cited external technique, with empirical scaling presented as separate validation. This matches the default expectation of no circularity.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

Abstract-only review provides insufficient detail to enumerate specific free parameters or invented entities; relies on standard datalog and the cited prior technique.

axioms (2)
  • standard math Standard seminaive evaluation for datalog materialization is correct and applicable.
    Paper references the well-known seminaive algorithm as the basis for centralised systems.
  • domain assumption Dynamic data exchange properties established in Potter et al. [12] transfer to the reasoning setting.
    The extension is built directly on this prior approach for query answering.

pith-pipeline@v0.9.0 · 5669 in / 1022 out tokens · 41679 ms · 2026-05-25T16:26:12.438120+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

21 extracted references · 21 canonical work pages

  1. [1]

    PVLDB 10(13), 2049–2060 (2017)

    Abdelaziz, I., Harbi, R., Khayyat, Z., Kalnis, P.: A Survey and Exper imental Comparison of Distributed SPARQL Engines for Very Large RDF Data . PVLDB 10(13), 2049–2060 (2017)

  2. [2]

    Addiso n- Wesley (1995)

    Abiteboul, S., Hull, R., Vianu, V.: Foundations of Databases. Addiso n- Wesley (1995)

  3. [3]

    Dijkstra, E., Feijen, W., van Gasteren, A.: Derivation of a Termina tion De- tection Algorithm for Distributed Computations. Inf. Process. Le tt. 16(5), 217–219 (1983)

  4. [4]

    Journal of Logic Programming 14(1–2), 101–126 (1992)

    Ganguly, S., Silberschatz, A., Tsur, S.: Parallel Bottom-Up Proce ssing of Datalog Queries. Journal of Logic Programming 14(1–2), 101–126 (1992)

  5. [5]

    In: IPDPS

    Gu, R., Wang, S., Wang, F., Yuan, C., Huang, Y.: Cichlid: Efficient Larg e Scale RDFS/OWL Reasoning with Spark. In: IPDPS. pp. 700–709 (20 15)

  6. [6]

    In: ISWC

    Kaoudi, Z., Miliaraki, I., Koubarakis, M.: RDFS Reasoning and Query A n- swering on Top of DHTs. In: ISWC. pp. 499–516 (2008)

  7. [7]

    In: ISWC

    Kolovski, V., Wu, Z., Eadon, G.: Optimizing Enterprise-Scale OWL 2 RL Reasoning in a Relational Database System. In: ISWC. pp. 436–452 (2010)

  8. [8]

    CACM 21(7), 558–565 (1978)

    Lamport, L.: Time, Clocks, and the Ordering of Events in a Distribu ted System. CACM 21(7), 558–565 (1978)

  9. [9]

    In: BeyondMR@SIGMOD 2017

    Liu, Y., McBrien, P.: SPOWL: Spark-based OWL 2 Reasoning Materia lisa- tion. In: BeyondMR@SIGMOD 2017. pp. 3:1–3:10 (2017)

  10. [10]

    In: AAAI

    Motik, B., Nenov, Y., Piro, R., Horrocks, I., Olteanu, D.: Parallel M ateri- alisation of Datalog Programs in Centralised, Main-Memory RDF Syste ms. In: AAAI. pp. 129–137 (2014)

  11. [11]

    JWS 7(4), 305–316 (2009)

    Oren, E., Kotoulas, S., Anadiotis, G., Siebes, R., ten Teije, A., van Harme- len, F.: Marvin: Distributed reasoning over large-scale Semantic Web data. JWS 7(4), 305–316 (2009)

  12. [12]

    IEEE TKDE 30(12), 2312–2325 (2018)

    Potter, A., Motik, B., Nenov, Y., Horrocks, I.: Dynamic Data Exc hange in Distributed RDF Stores. IEEE TKDE 30(12), 2312–2325 (2018)

  13. [13]

    In: PODS

    Seib, J., Lausen, G.: Parallelizing Datalog Programs by Generalized Pivot- ing. In: PODS. pp. 241–251 (1991)

  14. [14]

    PVLDB 6(14), 1906–19 17 (2013)

    Seo, J., Park, J., Shin, J., Lam, M.: Distributed SociaLite: A Datalo g-Based Language for Large-Scale Graph Analysis. PVLDB 6(14), 1906–19 17 (2013)

  15. [15]

    In: PDIS

    Shao, J., Bell, D., Hull, E.: Combining Rule Decomposition and Data Par - titioning in Parallel Datalog Processing. In: PDIS. pp. 106–115 (199 1)

  16. [16]

    JWS 10 (20 12)

    Urbani, J., Kotoulas, S., Maassen, J., van Harmelen, F., Bal, H.: We bPIE: A Web-scale Parallel Inference Engine using MapReduce. JWS 10 (20 12)

  17. [17]

    In: AAAI

    Urbani, J., Jacobs, C., Kr¨ otzsch, M.: Column-Oriented Datalog Material- ization for Large Knowledge Graphs. In: AAAI. pp. 258–264 (2016 )

  18. [18]

    In: ISWC

    Weaver, J., Hendler, J.A.: Parallel Materialization of the Finite RDF S Clo- sure for Hundreds of Millions of Triples. In: ISWC. pp. 682–697 (200 9)

  19. [19]

    IEEE TKDE 5(3), 523–530 (1993)

    Wolfson, O., Ozeri, A.: Parallel and Distributed Processing of Rule s by Data-Reduction. IEEE TKDE 5(3), 523–530 (1993)

  20. [20]

    In: WISE

    Wu, H., Liu, J., Wang, T., Ye, D., Wei, J., Zhong, H.: Parallel Materializ a- tion of Datalog Programs with Spark. In: WISE. pp. 363–379 (2016 )

  21. [21]

    IEEE TKDE 7(1), 163–176 (1995) 18 T

    Zhang, W., Wang, K., Chau, S.C.: Data Partition and Parallel Evalua tion of Datalog Programs. IEEE TKDE 7(1), 163–176 (1995) 18 T. Ajileye et al. A Proofs Lemma 1. In each run of the algorithm, for each server k, and all facts f1 and f2, we have T (f1) < T (f2) whenever one of the following holds: – PARk(f1, i ) ⇝ addk(f2) for some i, – processk(f1) ⇝ FCT...