Datalog Materialisation in Distributed RDF Stores with Dynamic Data Exchange

Boris Motik; Ian Horrocks; Temitope Ajileye

arxiv: 1906.10261 · v1 · pith:YMIMPAOPnew · submitted 2019-06-24 · 💻 cs.DB · cs.DC· cs.LO

Datalog Materialisation in Distributed RDF Stores with Dynamic Data Exchange

Temitope Ajileye , Boris Motik , Ian Horrocks This is my paper

Pith reviewed 2026-05-25 16:26 UTC · model grok-4.3

classification 💻 cs.DB cs.DCcs.LO

keywords Datalog materialisationDistributed RDF storesSeminaive evaluationDynamic data exchangeReasoning over RDFArbitrary datalog rules

0 comments

The pith

Dynamic data exchange extends to arbitrary datalog materialisation in distributed RDF stores while avoiding repeated inferences.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper adapts an existing technique for answering queries across many servers to the task of precomputing all facts implied by datalog rules over RDF data. This produces a distributed version of the seminaive materialisation algorithm that works for any rules rather than restricted fragments. The extension keeps the guarantee that each inference is derived only once. Experiments show the resulting system processes datasets too large for a single machine.

Core claim

The dynamic data exchange approach for distributed query answering can be extended to a reasoning algorithm that handles arbitrary datalog rules while preserving non-repetition of inferences.

What carries the argument

Dynamic data exchange mechanism, which coordinates the movement of newly derived facts between servers during evaluation.

If this is right

Distributed RDF stores can now support full datalog reasoning instead of being limited to queries or simple rules.
Non-repetition of inferences remains guaranteed in the distributed setting.
Materialisation becomes feasible for RDF datasets that exceed single-machine memory.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same coordination layer might support hybrid workloads that mix materialisation and on-demand queries.
The approach could reduce the need for rule restrictions in distributed knowledge graphs.
Further scaling tests on clusters with thousands of nodes would clarify practical limits.

Load-bearing premise

The coordination rules used for queries can be lifted to arbitrary datalog rules without creating extra messages that cause the same inference to be recomputed on multiple servers.

What would settle it

An implementation that produces duplicate inferences on any test rule set or that fails to finish on an RDF dataset larger than centralised systems can handle would falsify the central claim.

read the original abstract

Several centralised RDF systems support datalog reasoning by precomputing and storing all logically implied triples using the wellknown seminaive algorithm. Large RDF datasets often exceed the capacity of centralised RDF systems, and a common solution is to distribute the datasets in a cluster of shared-nothing servers. While numerous distributed query answering techniques are known, distributed seminaive evaluation of arbitrary datalog rules is less understood. In fact, most distributed RDF stores either support no reasoning or can handle only limited datalog fragments. In this paper we extend the dynamic data exchange approach for distributed query answering by Potter et al. [12] to a reasoning algorithm that can handle arbitrary rules while preserving important properties such as nonrepetition of inferences. We also show empirically that our algorithm scales well to very large RDF datasets

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper extends dynamic data exchange from queries to full arbitrary Datalog materialization in distributed RDF stores while preserving non-repetition, and reports empirical scaling on large data.

read the letter

The core contribution is taking Potter et al.'s dynamic data exchange method, which was for distributed query answering, and adapting it to seminaive evaluation of arbitrary Datalog rules over partitioned RDF data. The claim is that this keeps the non-repetition guarantee and avoids extra coordination costs that would kill performance. They also run experiments showing the approach handles very large datasets without blowing up.

Referee Report

1 major / 3 minor

Summary. The paper extends the dynamic data exchange approach for distributed query answering by Potter et al. [12] to support Datalog materialisation over arbitrary rules in distributed RDF stores. It claims that the extension preserves non-repetition of inferences and provides empirical evidence that the resulting algorithm scales to very large RDF datasets.

Significance. If the extension correctly preserves non-repetition without introducing prohibitive coordination overhead, the work would address a notable gap: most distributed RDF stores support either no reasoning or only limited Datalog fragments. The empirical scaling results on large datasets would strengthen the practical relevance of the contribution.

major comments (1)

[Section 4 (algorithm and correctness)] The central claim rests on extending the base dynamic data exchange technique to arbitrary (including recursive) rules while preserving non-repetition. The manuscript should provide an explicit argument or proof sketch for this preservation property under the distributed setting; without it the extension's correctness remains difficult to verify from the high-level description.

minor comments (3)

[Abstract] 'wellknown' should be hyphenated as 'well-known'.
[Abstract] 'nonrepetition' should be written as 'non-repetition' for consistency with standard terminology.
[Section 3] The description of how the dynamic exchange mechanism is adapted for rule bodies containing multiple atoms would benefit from a small worked example to clarify message patterns.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive review and the recommendation of minor revision. The suggestion to strengthen the correctness argument is well taken, and we will incorporate an explicit proof sketch as described below.

read point-by-point responses

Referee: [Section 4 (algorithm and correctness)] The central claim rests on extending the base dynamic data exchange technique to arbitrary (including recursive) rules while preserving non-repetition. The manuscript should provide an explicit argument or proof sketch for this preservation property under the distributed setting; without it the extension's correctness remains difficult to verify from the high-level description.

Authors: We agree that an explicit argument is needed for clarity. The manuscript in Section 4 describes the extension of dynamic data exchange to seminaive materialization and asserts preservation of non-repetition by inheriting the base technique's invariants while handling recursion via stratified evaluation and dynamic tuple exchange. However, the high-level presentation does not include a self-contained proof sketch. In the revision we will add a concise proof sketch (approximately one page) in Section 4 that shows: (i) each inferred triple is derived exactly once by the local seminaive evaluator, (ii) the dynamic exchange protocol ensures that a fact is sent to a peer only when it is newly derived and not previously exchanged, and (iii) recursion does not violate this because the stratification and delta-based propagation maintain the same non-repetition guarantee as the centralized case. This will make the correctness argument verifiable without requiring the reader to reconstruct it from the base paper [12]. revision: yes

Circularity Check

0 steps flagged

No significant circularity; extension relies on external prior work

full rationale

The paper's central contribution is an algorithmic extension of the dynamic data exchange mechanism from Potter et al. [12] (external authors) to arbitrary Datalog rules for materialisation, preserving non-repetition of inferences. No self-citation load-bearing, no self-definitional steps, no fitted inputs renamed as predictions, and no ansatz or uniqueness claims imported from the authors' own prior work. The derivation chain is self-contained against the cited external technique, with empirical scaling presented as separate validation. This matches the default expectation of no circularity.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

Abstract-only review provides insufficient detail to enumerate specific free parameters or invented entities; relies on standard datalog and the cited prior technique.

axioms (2)

standard math Standard seminaive evaluation for datalog materialization is correct and applicable.
Paper references the well-known seminaive algorithm as the basis for centralised systems.
domain assumption Dynamic data exchange properties established in Potter et al. [12] transfer to the reasoning setting.
The extension is built directly on this prior approach for query answering.

pith-pipeline@v0.9.0 · 5669 in / 1022 out tokens · 41679 ms · 2026-05-25T16:26:12.438120+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

we extend the dynamic data exchange approach for distributed query answering by Potter et al. [12] to a reasoning algorithm that can handle arbitrary rules while preserving important properties such as nonrepetition of inferences
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

we use Lamport timestamps [8], which provide a cheap way of determining a partial order of events across a cluster

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

21 extracted references · 21 canonical work pages

[1]

PVLDB 10(13), 2049–2060 (2017)

Abdelaziz, I., Harbi, R., Khayyat, Z., Kalnis, P.: A Survey and Exper imental Comparison of Distributed SPARQL Engines for Very Large RDF Data . PVLDB 10(13), 2049–2060 (2017)

work page 2049
[2]

Addiso n- Wesley (1995)

Abiteboul, S., Hull, R., Vianu, V.: Foundations of Databases. Addiso n- Wesley (1995)

work page 1995
[3]

Dijkstra, E., Feijen, W., van Gasteren, A.: Derivation of a Termina tion De- tection Algorithm for Distributed Computations. Inf. Process. Le tt. 16(5), 217–219 (1983)

work page 1983
[4]

Journal of Logic Programming 14(1–2), 101–126 (1992)

Ganguly, S., Silberschatz, A., Tsur, S.: Parallel Bottom-Up Proce ssing of Datalog Queries. Journal of Logic Programming 14(1–2), 101–126 (1992)

work page 1992
[5]

In: IPDPS

Gu, R., Wang, S., Wang, F., Yuan, C., Huang, Y.: Cichlid: Eﬃcient Larg e Scale RDFS/OWL Reasoning with Spark. In: IPDPS. pp. 700–709 (20 15)

work page
[6]

In: ISWC

Kaoudi, Z., Miliaraki, I., Koubarakis, M.: RDFS Reasoning and Query A n- swering on Top of DHTs. In: ISWC. pp. 499–516 (2008)

work page 2008
[7]

In: ISWC

Kolovski, V., Wu, Z., Eadon, G.: Optimizing Enterprise-Scale OWL 2 RL Reasoning in a Relational Database System. In: ISWC. pp. 436–452 (2010)

work page 2010
[8]

CACM 21(7), 558–565 (1978)

Lamport, L.: Time, Clocks, and the Ordering of Events in a Distribu ted System. CACM 21(7), 558–565 (1978)

work page 1978
[9]

In: BeyondMR@SIGMOD 2017

Liu, Y., McBrien, P.: SPOWL: Spark-based OWL 2 Reasoning Materia lisa- tion. In: BeyondMR@SIGMOD 2017. pp. 3:1–3:10 (2017)

work page 2017
[10]

In: AAAI

Motik, B., Nenov, Y., Piro, R., Horrocks, I., Olteanu, D.: Parallel M ateri- alisation of Datalog Programs in Centralised, Main-Memory RDF Syste ms. In: AAAI. pp. 129–137 (2014)

work page 2014
[11]

JWS 7(4), 305–316 (2009)

Oren, E., Kotoulas, S., Anadiotis, G., Siebes, R., ten Teije, A., van Harme- len, F.: Marvin: Distributed reasoning over large-scale Semantic Web data. JWS 7(4), 305–316 (2009)

work page 2009
[12]

IEEE TKDE 30(12), 2312–2325 (2018)

Potter, A., Motik, B., Nenov, Y., Horrocks, I.: Dynamic Data Exc hange in Distributed RDF Stores. IEEE TKDE 30(12), 2312–2325 (2018)

work page 2018
[13]

In: PODS

Seib, J., Lausen, G.: Parallelizing Datalog Programs by Generalized Pivot- ing. In: PODS. pp. 241–251 (1991)

work page 1991
[14]

PVLDB 6(14), 1906–19 17 (2013)

Seo, J., Park, J., Shin, J., Lam, M.: Distributed SociaLite: A Datalo g-Based Language for Large-Scale Graph Analysis. PVLDB 6(14), 1906–19 17 (2013)

work page 1906
[15]

In: PDIS

Shao, J., Bell, D., Hull, E.: Combining Rule Decomposition and Data Par - titioning in Parallel Datalog Processing. In: PDIS. pp. 106–115 (199 1)

work page
[16]

JWS 10 (20 12)

Urbani, J., Kotoulas, S., Maassen, J., van Harmelen, F., Bal, H.: We bPIE: A Web-scale Parallel Inference Engine using MapReduce. JWS 10 (20 12)

work page
[17]

In: AAAI

Urbani, J., Jacobs, C., Kr¨ otzsch, M.: Column-Oriented Datalog Material- ization for Large Knowledge Graphs. In: AAAI. pp. 258–264 (2016 )

work page 2016
[18]

In: ISWC

Weaver, J., Hendler, J.A.: Parallel Materialization of the Finite RDF S Clo- sure for Hundreds of Millions of Triples. In: ISWC. pp. 682–697 (200 9)

work page
[19]

IEEE TKDE 5(3), 523–530 (1993)

Wolfson, O., Ozeri, A.: Parallel and Distributed Processing of Rule s by Data-Reduction. IEEE TKDE 5(3), 523–530 (1993)

work page 1993
[20]

In: WISE

Wu, H., Liu, J., Wang, T., Ye, D., Wei, J., Zhong, H.: Parallel Materializ a- tion of Datalog Programs with Spark. In: WISE. pp. 363–379 (2016 )

work page 2016
[21]

IEEE TKDE 7(1), 163–176 (1995) 18 T

Zhang, W., Wang, K., Chau, S.C.: Data Partition and Parallel Evalua tion of Datalog Programs. IEEE TKDE 7(1), 163–176 (1995) 18 T. Ajileye et al. A Proofs Lemma 1. In each run of the algorithm, for each server k, and all facts f1 and f2, we have T (f1) < T (f2) whenever one of the following holds: – PARk(f1, i ) ⇝ addk(f2) for some i, – processk(f1) ⇝ FCT...

work page 1995

[1] [1]

PVLDB 10(13), 2049–2060 (2017)

Abdelaziz, I., Harbi, R., Khayyat, Z., Kalnis, P.: A Survey and Exper imental Comparison of Distributed SPARQL Engines for Very Large RDF Data . PVLDB 10(13), 2049–2060 (2017)

work page 2049

[2] [2]

Addiso n- Wesley (1995)

Abiteboul, S., Hull, R., Vianu, V.: Foundations of Databases. Addiso n- Wesley (1995)

work page 1995

[3] [3]

Dijkstra, E., Feijen, W., van Gasteren, A.: Derivation of a Termina tion De- tection Algorithm for Distributed Computations. Inf. Process. Le tt. 16(5), 217–219 (1983)

work page 1983

[4] [4]

Journal of Logic Programming 14(1–2), 101–126 (1992)

Ganguly, S., Silberschatz, A., Tsur, S.: Parallel Bottom-Up Proce ssing of Datalog Queries. Journal of Logic Programming 14(1–2), 101–126 (1992)

work page 1992

[5] [5]

In: IPDPS

Gu, R., Wang, S., Wang, F., Yuan, C., Huang, Y.: Cichlid: Eﬃcient Larg e Scale RDFS/OWL Reasoning with Spark. In: IPDPS. pp. 700–709 (20 15)

work page

[6] [6]

In: ISWC

Kaoudi, Z., Miliaraki, I., Koubarakis, M.: RDFS Reasoning and Query A n- swering on Top of DHTs. In: ISWC. pp. 499–516 (2008)

work page 2008

[7] [7]

In: ISWC

Kolovski, V., Wu, Z., Eadon, G.: Optimizing Enterprise-Scale OWL 2 RL Reasoning in a Relational Database System. In: ISWC. pp. 436–452 (2010)

work page 2010

[8] [8]

CACM 21(7), 558–565 (1978)

Lamport, L.: Time, Clocks, and the Ordering of Events in a Distribu ted System. CACM 21(7), 558–565 (1978)

work page 1978

[9] [9]

In: BeyondMR@SIGMOD 2017

Liu, Y., McBrien, P.: SPOWL: Spark-based OWL 2 Reasoning Materia lisa- tion. In: BeyondMR@SIGMOD 2017. pp. 3:1–3:10 (2017)

work page 2017

[10] [10]

In: AAAI

Motik, B., Nenov, Y., Piro, R., Horrocks, I., Olteanu, D.: Parallel M ateri- alisation of Datalog Programs in Centralised, Main-Memory RDF Syste ms. In: AAAI. pp. 129–137 (2014)

work page 2014

[11] [11]

JWS 7(4), 305–316 (2009)

Oren, E., Kotoulas, S., Anadiotis, G., Siebes, R., ten Teije, A., van Harme- len, F.: Marvin: Distributed reasoning over large-scale Semantic Web data. JWS 7(4), 305–316 (2009)

work page 2009

[12] [12]

IEEE TKDE 30(12), 2312–2325 (2018)

Potter, A., Motik, B., Nenov, Y., Horrocks, I.: Dynamic Data Exc hange in Distributed RDF Stores. IEEE TKDE 30(12), 2312–2325 (2018)

work page 2018

[13] [13]

In: PODS

Seib, J., Lausen, G.: Parallelizing Datalog Programs by Generalized Pivot- ing. In: PODS. pp. 241–251 (1991)

work page 1991

[14] [14]

PVLDB 6(14), 1906–19 17 (2013)

Seo, J., Park, J., Shin, J., Lam, M.: Distributed SociaLite: A Datalo g-Based Language for Large-Scale Graph Analysis. PVLDB 6(14), 1906–19 17 (2013)

work page 1906

[15] [15]

In: PDIS

Shao, J., Bell, D., Hull, E.: Combining Rule Decomposition and Data Par - titioning in Parallel Datalog Processing. In: PDIS. pp. 106–115 (199 1)

work page

[16] [16]

JWS 10 (20 12)

Urbani, J., Kotoulas, S., Maassen, J., van Harmelen, F., Bal, H.: We bPIE: A Web-scale Parallel Inference Engine using MapReduce. JWS 10 (20 12)

work page

[17] [17]

In: AAAI

Urbani, J., Jacobs, C., Kr¨ otzsch, M.: Column-Oriented Datalog Material- ization for Large Knowledge Graphs. In: AAAI. pp. 258–264 (2016 )

work page 2016

[18] [18]

In: ISWC

Weaver, J., Hendler, J.A.: Parallel Materialization of the Finite RDF S Clo- sure for Hundreds of Millions of Triples. In: ISWC. pp. 682–697 (200 9)

work page

[19] [19]

IEEE TKDE 5(3), 523–530 (1993)

Wolfson, O., Ozeri, A.: Parallel and Distributed Processing of Rule s by Data-Reduction. IEEE TKDE 5(3), 523–530 (1993)

work page 1993

[20] [20]

In: WISE

Wu, H., Liu, J., Wang, T., Ye, D., Wei, J., Zhong, H.: Parallel Materializ a- tion of Datalog Programs with Spark. In: WISE. pp. 363–379 (2016 )

work page 2016

[21] [21]

IEEE TKDE 7(1), 163–176 (1995) 18 T

Zhang, W., Wang, K., Chau, S.C.: Data Partition and Parallel Evalua tion of Datalog Programs. IEEE TKDE 7(1), 163–176 (1995) 18 T. Ajileye et al. A Proofs Lemma 1. In each run of the algorithm, for each server k, and all facts f1 and f2, we have T (f1) < T (f2) whenever one of the following holds: – PARk(f1, i ) ⇝ addk(f2) for some i, – processk(f1) ⇝ FCT...

work page 1995