A Software Framework and Datasets for the Analysis of Graph Measures on RDF Graphs

Daniel Hienert; Maribel Acosta; Matth\"aus Zloch; Stefan Conrad; Stefan Dietze

arxiv: 1907.01885 · v1 · pith:7B43UVKVnew · submitted 2019-07-03 · 💻 cs.DB

A Software Framework and Datasets for the Analysis of Graph Measures on RDF Graphs

Matth\"aus Zloch , Maribel Acosta , Daniel Hienert , Stefan Dietze , Stefan Conrad This is my paper

Pith reviewed 2026-05-25 09:38 UTC · model grok-4.3

classification 💻 cs.DB

keywords RDF graphsgraph measuresLOD CloudSemantic Webgraph topologysoftware frameworkdataset analysis

0 comments

The pith

A framework and analysis of 280 RDF datasets identify measures that characterize Semantic Web graphs.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper presents a software framework for acquiring and analyzing the topology of large RDF graphs using various graph measures. It applies the framework to 280 datasets from the Linked Open Data Cloud, calculating 28 different measures for each. A preliminary analysis of these results points to a smaller subset of measures that can effectively characterize the structure of graphs in the Semantic Web. Such characterization would support the creation of better synthetic dataset generators, sampling methods, and query optimizers for RDF data.

Core claim

We propose a software framework able to acquire, prepare, and perform a graph-based analysis on the topology of large RDF graphs, and we provide results on a graph-based analysis of 280 datasets from the LOD Cloud with values for 28 graph measures computed with the framework. We present a preliminary analysis based on the proposed resources and point out implications for synthetic dataset generators. Finally, we identify a set of measures that can be used to characterize graphs in the Semantic Web.

What carries the argument

The software framework that acquires, prepares, and computes 28 graph measures on RDF graphs, applied across 280 LOD Cloud datasets to support identification of a characterizing subset.

Load-bearing premise

The 280 selected LOD Cloud datasets are sufficiently representative of RDF graphs that the observed measure distributions and the identified characterizing subset will generalize to other RDF collections.

What would settle it

Applying the same framework and analysis process to a different collection of RDF graphs and finding that a different subset of measures is required to characterize them.

Figures

Figures reproduced from arXiv: 1907.01885 by Daniel Hienert, Maribel Acosta, Matth\"aus Zloch, Stefan Conrad, Stefan Dietze.

**Figure 2.** Figure 2: Average degree z. The x-axis is ordered by the number of edges m. The slope of trend lines is computed by robust regression using M-estimation. registered DOI5 . The aforementioned website5 is automatically generated from the results. It contains all 280 datasets that were analyzed, grouped by topic domains (as in the LOD Cloud) together with links (a) to the original metadata obtained from datahub and (b)… view at source ↗

**Figure 3.** Figure 3: h-index. The x-axis (log scale) is ordered by the number of edges m. Each plot has the same range for the x-axis. R2 measures how well the regression fits. The closer to 1 the better the prediction. domain, with 63.50 edges per vertex on average (bio2rdf-irefindex ). Over all observed domains and datasets, the value is 7.9 on average (with a standard deviation of 1.71). Datasets in Cross Domain have the lo… view at source ↗

**Figure 4.** Figure 4: Exemplary plots created by the framework for datasets of different sizes. [PITH_FULL_IMAGE:figures/full_fig_p013_4.png] view at source ↗

**Figure 5.** Figure 5: Measure correlation One may come to the question, which measures are essential for graph characterization. We noticed that many measures rely on the degree of a vertex. A Pearson correlation test on the results of the analysis of datasets from Section 4 shows that n, m, mu, and mp, correlate strongly to both hindex measures and to the standard descriptive statistical measure. The degree of centralizat… view at source ↗

read the original abstract

As the availability and the inter-connectivity of RDF datasets grow, so does the necessity to understand the structure of the data. Understanding the topology of RDF graphs can guide and inform the development of, e.g. synthetic dataset generators, sampling methods, index structures, or query optimizers. In this work, we propose two resources: (i) a software framework able to acquire, prepare, and perform a graph-based analysis on the topology of large RDF graphs, and (ii) results on a graph-based analysis of 280 datasets from the LOD Cloud with values for 28 graph measures computed with the framework. We present a preliminary analysis based on the proposed resources and point out implications for synthetic dataset generators. Finally, we identify a set of measures, that can be used to characterize graphs in the Semantic Web.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper ships a usable framework plus a table of 28 measures on 280 LOD datasets; that resource is the real output, and the characterization claim is secondary and sample-dependent.

read the letter

The main thing here is the release of code that ingests RDF, turns it into graphs, and computes standard topology measures, together with the resulting numbers for 280 public datasets. That combination is new and fills a practical gap for people who need RDF-specific baselines rather than generic graph stats pulled from other domains. The preliminary analysis and the short list of characterizing measures are useful side products but rest on the same data release. The work is straightforward engineering plus measurement; nothing in the math or the method is claimed to be novel beyond the RDF packaging. The framework itself looks like the part that could see reuse, especially if the code is clean and documented enough for others to extend. The 280-dataset table is the other concrete deliverable that was not previously available in one place. The soft spot is the representativeness question. The claim that a subset of measures can characterize Semantic Web graphs is drawn from distributions and correlations inside this particular collection. If the LOD Cloud sample over-represents certain sizes, domains, or connectivity patterns, the selected measures may not generalize. The abstract gives no numbers on validation, missing-value handling, or runtime behavior on the largest graphs, so anyone planning to rely on the numbers will want to inspect the implementation. This paper is aimed at Semantic Web researchers who build generators, indexes, or optimizers and need empirical grounding. A reader who wants ready numbers or a starting point for their own pipeline will get immediate value. It is not a theoretical advance, but the resource contribution is solid enough to justify referee time. I would send it out for review rather than desk-reject; the community can judge whether the framework is maintainable and whether the sample bias is acceptable for their purposes.

Referee Report

2 major / 2 minor

Summary. The paper presents a software framework for acquiring, preparing, and computing 28 graph measures on large RDF graphs, releases the computed values for 280 LOD Cloud datasets, provides a preliminary analysis of the resulting distributions and correlations, discusses implications for synthetic dataset generators, and identifies a subset of measures sufficient to characterize Semantic Web graphs.

Significance. If the framework is robust and the empirical results reproducible, the release of both the analysis tool and the dataset of 28 measures across 280 real RDF graphs constitutes a useful community resource. The work supplies concrete measurements against public external datasets rather than fitted models, which can directly inform generator design and index development.

major comments (2)

[Dataset selection and preliminary analysis] The identification of a characterizing subset of measures rests on distributions observed across the 280 selected LOD Cloud datasets. The manuscript should explicitly address how these datasets were sampled to ensure coverage of topological diversity (size, density, domain, labeling patterns); without such justification the selected subset risks being specific to the LOD Cloud collection rather than generally applicable to RDF graphs.
[Framework and computation sections] The framework description provides no details on validation of the 28 computed measures (e.g., cross-checks against known small graphs or reference implementations), error handling for very large inputs, or the exact procedure used to obtain the reported values. These omissions affect defensibility of the released dataset.

minor comments (2)

[Preliminary analysis] Clarify in the text which exact subset of the 28 measures is proposed as sufficient for characterization and how that subset was derived from the correlation analysis.
[Framework description] Add a brief statement on the computational resources required and any scalability limits observed when running the framework on the largest graphs.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments and the recommendation of minor revision. We address each major comment below.

read point-by-point responses

Referee: [Dataset selection and preliminary analysis] The identification of a characterizing subset of measures rests on distributions observed across the 280 selected LOD Cloud datasets. The manuscript should explicitly address how these datasets were sampled to ensure coverage of topological diversity (size, density, domain, labeling patterns); without such justification the selected subset risks being specific to the LOD Cloud collection rather than generally applicable to RDF graphs.

Authors: The 280 datasets comprise all LOD Cloud entries for which public data dumps could be successfully retrieved and processed by the framework at the time of the study. We agree that an explicit justification of selection criteria is needed to support generalizability. We will add a subsection detailing the selection process, including ranges of dataset sizes, domain coverage, density statistics, and other topological characteristics observed in the collection. revision: yes
Referee: [Framework and computation sections] The framework description provides no details on validation of the 28 computed measures (e.g., cross-checks against known small graphs or reference implementations), error handling for very large inputs, or the exact procedure used to obtain the reported values. These omissions affect defensibility of the released dataset.

Authors: We will expand the relevant sections to include validation details (testing on small graphs with known properties and cross-checks against reference libraries), error-handling mechanisms for large inputs (e.g., memory safeguards and graceful skipping), and the precise acquisition-preparation-computation pipeline used to produce the reported values. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical computation on external public datasets

full rationale

The paper supplies a software framework and reports values for 28 graph measures computed on 280 LOD Cloud datasets. The claim that a subset of measures characterizes Semantic Web graphs rests on observed distributions and correlations across those external datasets. No derivation, fitted parameter, prediction, or uniqueness theorem is presented that reduces to the authors' own inputs or prior self-citations. The analysis is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is an empirical resource paper. No mathematical derivations, fitted parameters, or new postulated entities are introduced; the contribution rests on the existence and correct execution of the described framework and the public availability of the LOD Cloud datasets.

pith-pipeline@v0.9.0 · 5676 in / 1114 out tokens · 28630 ms · 2026-05-25T09:38:40.822470+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

21 extracted references · 21 canonical work pages

[1]

PloS one 9(1), e85777 (2014)

Alstott, J., Bullmore, E., Plenz, D.: powerlaw: a python package for analysis of heavy-tailed distributions. PloS one 9(1), e85777 (2014)

work page 2014
[2]

Bachlechner, D., Strang, T.: Is the semantic web a small world? In: ITA. pp. 413– 422 (2007)

work page 2007
[3]

The Semantic Web Journal 9(5), 677–705 (2018)

Ben Elleﬁ, M., Bellahsene, Z., John, B., Demidova, E., Dietze, S., Szymanski, J., Todorov, K.: RDF Dataset Proﬁling - a Survey of Features, Methods, Vocabularies and Applications. The Semantic Web Journal 9(5), 677–705 (2018)

work page 2018
[4]

The Semantic Web Journal9(6), 859–901 (2018)

Debattista, J., Lange, C., Auer, S., Cortis, D.: Evaluating the quality of the lod cloud: An empirical investigation. The Semantic Web Journal9(6), 859–901 (2018)

work page 2018
[5]

In: EKAW (2012)

Demter, J., Auer, S., Martin, M., Lehmann, J.: LODStats - an extensible framework for high-performance dataset analytics. In: EKAW (2012)

work page 2012
[6]

In: International Semantic Web Conference, ISWC (2006)

Ding, L., Finin, T.: Characterizing the semantic web on the web. In: International Semantic Web Conference, ISWC (2006)

work page 2006
[7]

In: ACM SIGMOD

Duan, S., Kementsietsidis, A., Srinivas, K., Udrea, O.: Apples and oranges: a com- parison of RDF benchmarks and real RDF datasets. In: ACM SIGMOD. pp. 145–

work page
[8]

JIS 44(2), 203–229 (2018)

Fern´ andez, J.D., Mart´ ınez-Prieto, M.A., de la Fuente Redondo, P., Guti´ errez, C.: Characterising RDF data sets. JIS 44(2), 203–229 (2018)

work page 2018
[9]

In: ESWC Satellite Events

Flores, A., Vidal, M., Palma, G.: Graphium chrysalis: Exploiting graph database engines to analyze RDF graphs. In: ESWC Satellite Events. pp. 326–331 (2014)

work page 2014
[10]

Social Net- works 1(3), 215–239 (1979)

Freeman, L.C.: Centrality in social networks: Conceptual clariﬁcation. Social Net- works 1(3), 215–239 (1979)

work page 1979
[11]

Hirsch, J.E.: An index to quantify an individual’s scientiﬁc research output. Proc. National Academy of Sciences of the United States of America 102(46) (2005)

work page 2005
[12]

In: LDOW (2010)

Hogan, A., Harth, A., Passant, A., Decker, S., Polleres, A.: Weaving the pedantic web. In: LDOW (2010)

work page 2010
[13]

In: SIGKDD

Leskovec, J., Faloutsos, C.: Sampling from large graphs. In: SIGKDD. pp. 631–636 (2006)

work page 2006
[14]

In: ISWC Posters & Demonstrations

Mihindukulasooriya, N., Poveda-Villal´ on, M., Garc´ ıa-Castro, R., G´ omez-P´ erez, A.: Loupe - an online tool for inspecting datasets in the linked data cloud. In: ISWC Posters & Demonstrations. (2015)

work page 2015
[15]

Oxford University Press (2010)

Newman, M.E.J.: Networks: An Introduction. Oxford University Press (2010)

work page 2010
[16]

Page, L., Brin, S., Motwani, R., Winograd, T.: The pagerank citation ranking: Bringing order to the web. Tech. rep., Stanford InfoLab (1999)

work page 1999
[17]

In: SIGMOD

Qiao, S., zsoyoglu, Z.M.: RBench: Application-Speciﬁc RDF Benchmarking. In: SIGMOD. pp. 1825–1838. ACM (2015)

work page 2015
[18]

In: ISWC

Schmachtenberg, M., Bizer, C., Paulheim, H.: Adoption of the linked data best practices in diﬀerent topical domains. In: ISWC. pp. 245–260 (2014)

work page 2014
[19]

In: ISWC

Sejdiu, G., Ermilov, I., Lehmann, J., Mami, M.N.: DistLODStats: Distributed com- putation of RDF dataset statistics. In: ISWC. pp. 206–222 (2018)

work page 2018
[20]

PVLDB, Chal- lenges and Visions 4(12), 1470–1473 (2011)

Tay, Y.C.: Data generation for application-speciﬁc benchmarking. PVLDB, Chal- lenges and Visions 4(12), 1470–1473 (2011)

work page 2011
[21]

IEEE TKDE 20(5), 692–702 (2008)

Theoharis, Y., Tzitzikas, Y., Kotzinos, D., Christophides, V.: On graph features of semantic web schemas. IEEE TKDE 20(5), 692–702 (2008)

work page 2008

[1] [1]

PloS one 9(1), e85777 (2014)

Alstott, J., Bullmore, E., Plenz, D.: powerlaw: a python package for analysis of heavy-tailed distributions. PloS one 9(1), e85777 (2014)

work page 2014

[2] [2]

Bachlechner, D., Strang, T.: Is the semantic web a small world? In: ITA. pp. 413– 422 (2007)

work page 2007

[3] [3]

The Semantic Web Journal 9(5), 677–705 (2018)

Ben Elleﬁ, M., Bellahsene, Z., John, B., Demidova, E., Dietze, S., Szymanski, J., Todorov, K.: RDF Dataset Proﬁling - a Survey of Features, Methods, Vocabularies and Applications. The Semantic Web Journal 9(5), 677–705 (2018)

work page 2018

[4] [4]

The Semantic Web Journal9(6), 859–901 (2018)

Debattista, J., Lange, C., Auer, S., Cortis, D.: Evaluating the quality of the lod cloud: An empirical investigation. The Semantic Web Journal9(6), 859–901 (2018)

work page 2018

[5] [5]

In: EKAW (2012)

Demter, J., Auer, S., Martin, M., Lehmann, J.: LODStats - an extensible framework for high-performance dataset analytics. In: EKAW (2012)

work page 2012

[6] [6]

In: International Semantic Web Conference, ISWC (2006)

Ding, L., Finin, T.: Characterizing the semantic web on the web. In: International Semantic Web Conference, ISWC (2006)

work page 2006

[7] [7]

In: ACM SIGMOD

Duan, S., Kementsietsidis, A., Srinivas, K., Udrea, O.: Apples and oranges: a com- parison of RDF benchmarks and real RDF datasets. In: ACM SIGMOD. pp. 145–

work page

[8] [8]

JIS 44(2), 203–229 (2018)

Fern´ andez, J.D., Mart´ ınez-Prieto, M.A., de la Fuente Redondo, P., Guti´ errez, C.: Characterising RDF data sets. JIS 44(2), 203–229 (2018)

work page 2018

[9] [9]

In: ESWC Satellite Events

Flores, A., Vidal, M., Palma, G.: Graphium chrysalis: Exploiting graph database engines to analyze RDF graphs. In: ESWC Satellite Events. pp. 326–331 (2014)

work page 2014

[10] [10]

Social Net- works 1(3), 215–239 (1979)

Freeman, L.C.: Centrality in social networks: Conceptual clariﬁcation. Social Net- works 1(3), 215–239 (1979)

work page 1979

[11] [11]

Hirsch, J.E.: An index to quantify an individual’s scientiﬁc research output. Proc. National Academy of Sciences of the United States of America 102(46) (2005)

work page 2005

[12] [12]

In: LDOW (2010)

Hogan, A., Harth, A., Passant, A., Decker, S., Polleres, A.: Weaving the pedantic web. In: LDOW (2010)

work page 2010

[13] [13]

In: SIGKDD

Leskovec, J., Faloutsos, C.: Sampling from large graphs. In: SIGKDD. pp. 631–636 (2006)

work page 2006

[14] [14]

In: ISWC Posters & Demonstrations

Mihindukulasooriya, N., Poveda-Villal´ on, M., Garc´ ıa-Castro, R., G´ omez-P´ erez, A.: Loupe - an online tool for inspecting datasets in the linked data cloud. In: ISWC Posters & Demonstrations. (2015)

work page 2015

[15] [15]

Oxford University Press (2010)

Newman, M.E.J.: Networks: An Introduction. Oxford University Press (2010)

work page 2010

[16] [16]

Page, L., Brin, S., Motwani, R., Winograd, T.: The pagerank citation ranking: Bringing order to the web. Tech. rep., Stanford InfoLab (1999)

work page 1999

[17] [17]

In: SIGMOD

Qiao, S., zsoyoglu, Z.M.: RBench: Application-Speciﬁc RDF Benchmarking. In: SIGMOD. pp. 1825–1838. ACM (2015)

work page 2015

[18] [18]

In: ISWC

Schmachtenberg, M., Bizer, C., Paulheim, H.: Adoption of the linked data best practices in diﬀerent topical domains. In: ISWC. pp. 245–260 (2014)

work page 2014

[19] [19]

In: ISWC

Sejdiu, G., Ermilov, I., Lehmann, J., Mami, M.N.: DistLODStats: Distributed com- putation of RDF dataset statistics. In: ISWC. pp. 206–222 (2018)

work page 2018

[20] [20]

PVLDB, Chal- lenges and Visions 4(12), 1470–1473 (2011)

Tay, Y.C.: Data generation for application-speciﬁc benchmarking. PVLDB, Chal- lenges and Visions 4(12), 1470–1473 (2011)

work page 2011

[21] [21]

IEEE TKDE 20(5), 692–702 (2008)

Theoharis, Y., Tzitzikas, Y., Kotzinos, D., Christophides, V.: On graph features of semantic web schemas. IEEE TKDE 20(5), 692–702 (2008)

work page 2008