pith. sign in

arxiv: 1907.01885 · v1 · pith:7B43UVKVnew · submitted 2019-07-03 · 💻 cs.DB

A Software Framework and Datasets for the Analysis of Graph Measures on RDF Graphs

Pith reviewed 2026-05-25 09:38 UTC · model grok-4.3

classification 💻 cs.DB
keywords RDF graphsgraph measuresLOD CloudSemantic Webgraph topologysoftware frameworkdataset analysis
0
0 comments X

The pith

A framework and analysis of 280 RDF datasets identify measures that characterize Semantic Web graphs.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper presents a software framework for acquiring and analyzing the topology of large RDF graphs using various graph measures. It applies the framework to 280 datasets from the Linked Open Data Cloud, calculating 28 different measures for each. A preliminary analysis of these results points to a smaller subset of measures that can effectively characterize the structure of graphs in the Semantic Web. Such characterization would support the creation of better synthetic dataset generators, sampling methods, and query optimizers for RDF data.

Core claim

We propose a software framework able to acquire, prepare, and perform a graph-based analysis on the topology of large RDF graphs, and we provide results on a graph-based analysis of 280 datasets from the LOD Cloud with values for 28 graph measures computed with the framework. We present a preliminary analysis based on the proposed resources and point out implications for synthetic dataset generators. Finally, we identify a set of measures that can be used to characterize graphs in the Semantic Web.

What carries the argument

The software framework that acquires, prepares, and computes 28 graph measures on RDF graphs, applied across 280 LOD Cloud datasets to support identification of a characterizing subset.

Load-bearing premise

The 280 selected LOD Cloud datasets are sufficiently representative of RDF graphs that the observed measure distributions and the identified characterizing subset will generalize to other RDF collections.

What would settle it

Applying the same framework and analysis process to a different collection of RDF graphs and finding that a different subset of measures is required to characterize them.

Figures

Figures reproduced from arXiv: 1907.01885 by Daniel Hienert, Maribel Acosta, Matth\"aus Zloch, Stefan Conrad, Stefan Dietze.

Figure 1
Figure 1. Figure 1: Illustration of the semi-automatic process pipeline. Steps 1-4 include [PITH_FULL_IMAGE:figures/full_fig_p004_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Average degree z. The x-axis is ordered by the number of edges m. The slope of trend lines is computed by robust regression using M-estimation. registered DOI5 . The aforementioned website5 is automatically generated from the results. It contains all 280 datasets that were analyzed, grouped by topic domains (as in the LOD Cloud) together with links (a) to the original metadata obtained from datahub and (b)… view at source ↗
Figure 3
Figure 3. Figure 3: h-index. The x-axis (log scale) is ordered by the number of edges m. Each plot has the same range for the x-axis. R2 measures how well the regression fits. The closer to 1 the better the prediction. domain, with 63.50 edges per vertex on average (bio2rdf-irefindex ). Over all observed domains and datasets, the value is 7.9 on average (with a standard deviation of 1.71). Datasets in Cross Domain have the lo… view at source ↗
Figure 4
Figure 4. Figure 4: Exemplary plots created by the framework for datasets of different sizes. [PITH_FULL_IMAGE:figures/full_fig_p013_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Measure correlation One may come to the question, which measures are essential for graph characteri￾zation. We noticed that many measures rely on the degree of a vertex. A Pearson cor￾relation test on the results of the analysis of datasets from Section 4 shows that n, m, mu, and mp, correlate strongly to both h￾index measures and to the standard descrip￾tive statistical measure. The degree of cen￾tralizat… view at source ↗
read the original abstract

As the availability and the inter-connectivity of RDF datasets grow, so does the necessity to understand the structure of the data. Understanding the topology of RDF graphs can guide and inform the development of, e.g. synthetic dataset generators, sampling methods, index structures, or query optimizers. In this work, we propose two resources: (i) a software framework able to acquire, prepare, and perform a graph-based analysis on the topology of large RDF graphs, and (ii) results on a graph-based analysis of 280 datasets from the LOD Cloud with values for 28 graph measures computed with the framework. We present a preliminary analysis based on the proposed resources and point out implications for synthetic dataset generators. Finally, we identify a set of measures, that can be used to characterize graphs in the Semantic Web.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper presents a software framework for acquiring, preparing, and computing 28 graph measures on large RDF graphs, releases the computed values for 280 LOD Cloud datasets, provides a preliminary analysis of the resulting distributions and correlations, discusses implications for synthetic dataset generators, and identifies a subset of measures sufficient to characterize Semantic Web graphs.

Significance. If the framework is robust and the empirical results reproducible, the release of both the analysis tool and the dataset of 28 measures across 280 real RDF graphs constitutes a useful community resource. The work supplies concrete measurements against public external datasets rather than fitted models, which can directly inform generator design and index development.

major comments (2)
  1. [Dataset selection and preliminary analysis] The identification of a characterizing subset of measures rests on distributions observed across the 280 selected LOD Cloud datasets. The manuscript should explicitly address how these datasets were sampled to ensure coverage of topological diversity (size, density, domain, labeling patterns); without such justification the selected subset risks being specific to the LOD Cloud collection rather than generally applicable to RDF graphs.
  2. [Framework and computation sections] The framework description provides no details on validation of the 28 computed measures (e.g., cross-checks against known small graphs or reference implementations), error handling for very large inputs, or the exact procedure used to obtain the reported values. These omissions affect defensibility of the released dataset.
minor comments (2)
  1. [Preliminary analysis] Clarify in the text which exact subset of the 28 measures is proposed as sufficient for characterization and how that subset was derived from the correlation analysis.
  2. [Framework description] Add a brief statement on the computational resources required and any scalability limits observed when running the framework on the largest graphs.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments and the recommendation of minor revision. We address each major comment below.

read point-by-point responses
  1. Referee: [Dataset selection and preliminary analysis] The identification of a characterizing subset of measures rests on distributions observed across the 280 selected LOD Cloud datasets. The manuscript should explicitly address how these datasets were sampled to ensure coverage of topological diversity (size, density, domain, labeling patterns); without such justification the selected subset risks being specific to the LOD Cloud collection rather than generally applicable to RDF graphs.

    Authors: The 280 datasets comprise all LOD Cloud entries for which public data dumps could be successfully retrieved and processed by the framework at the time of the study. We agree that an explicit justification of selection criteria is needed to support generalizability. We will add a subsection detailing the selection process, including ranges of dataset sizes, domain coverage, density statistics, and other topological characteristics observed in the collection. revision: yes

  2. Referee: [Framework and computation sections] The framework description provides no details on validation of the 28 computed measures (e.g., cross-checks against known small graphs or reference implementations), error handling for very large inputs, or the exact procedure used to obtain the reported values. These omissions affect defensibility of the released dataset.

    Authors: We will expand the relevant sections to include validation details (testing on small graphs with known properties and cross-checks against reference libraries), error-handling mechanisms for large inputs (e.g., memory safeguards and graceful skipping), and the precise acquisition-preparation-computation pipeline used to produce the reported values. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical computation on external public datasets

full rationale

The paper supplies a software framework and reports values for 28 graph measures computed on 280 LOD Cloud datasets. The claim that a subset of measures characterizes Semantic Web graphs rests on observed distributions and correlations across those external datasets. No derivation, fitted parameter, prediction, or uniqueness theorem is presented that reduces to the authors' own inputs or prior self-citations. The analysis is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is an empirical resource paper. No mathematical derivations, fitted parameters, or new postulated entities are introduced; the contribution rests on the existence and correct execution of the described framework and the public availability of the LOD Cloud datasets.

pith-pipeline@v0.9.0 · 5676 in / 1114 out tokens · 28630 ms · 2026-05-25T09:38:40.822470+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

21 extracted references · 21 canonical work pages

  1. [1]

    PloS one 9(1), e85777 (2014)

    Alstott, J., Bullmore, E., Plenz, D.: powerlaw: a python package for analysis of heavy-tailed distributions. PloS one 9(1), e85777 (2014)

  2. [2]

    Bachlechner, D., Strang, T.: Is the semantic web a small world? In: ITA. pp. 413– 422 (2007)

  3. [3]

    The Semantic Web Journal 9(5), 677–705 (2018)

    Ben Ellefi, M., Bellahsene, Z., John, B., Demidova, E., Dietze, S., Szymanski, J., Todorov, K.: RDF Dataset Profiling - a Survey of Features, Methods, Vocabularies and Applications. The Semantic Web Journal 9(5), 677–705 (2018)

  4. [4]

    The Semantic Web Journal9(6), 859–901 (2018)

    Debattista, J., Lange, C., Auer, S., Cortis, D.: Evaluating the quality of the lod cloud: An empirical investigation. The Semantic Web Journal9(6), 859–901 (2018)

  5. [5]

    In: EKAW (2012)

    Demter, J., Auer, S., Martin, M., Lehmann, J.: LODStats - an extensible framework for high-performance dataset analytics. In: EKAW (2012)

  6. [6]

    In: International Semantic Web Conference, ISWC (2006)

    Ding, L., Finin, T.: Characterizing the semantic web on the web. In: International Semantic Web Conference, ISWC (2006)

  7. [7]

    In: ACM SIGMOD

    Duan, S., Kementsietsidis, A., Srinivas, K., Udrea, O.: Apples and oranges: a com- parison of RDF benchmarks and real RDF datasets. In: ACM SIGMOD. pp. 145–

  8. [8]

    JIS 44(2), 203–229 (2018)

    Fern´ andez, J.D., Mart´ ınez-Prieto, M.A., de la Fuente Redondo, P., Guti´ errez, C.: Characterising RDF data sets. JIS 44(2), 203–229 (2018)

  9. [9]

    In: ESWC Satellite Events

    Flores, A., Vidal, M., Palma, G.: Graphium chrysalis: Exploiting graph database engines to analyze RDF graphs. In: ESWC Satellite Events. pp. 326–331 (2014)

  10. [10]

    Social Net- works 1(3), 215–239 (1979)

    Freeman, L.C.: Centrality in social networks: Conceptual clarification. Social Net- works 1(3), 215–239 (1979)

  11. [11]

    Hirsch, J.E.: An index to quantify an individual’s scientific research output. Proc. National Academy of Sciences of the United States of America 102(46) (2005)

  12. [12]

    In: LDOW (2010)

    Hogan, A., Harth, A., Passant, A., Decker, S., Polleres, A.: Weaving the pedantic web. In: LDOW (2010)

  13. [13]

    In: SIGKDD

    Leskovec, J., Faloutsos, C.: Sampling from large graphs. In: SIGKDD. pp. 631–636 (2006)

  14. [14]

    In: ISWC Posters & Demonstrations

    Mihindukulasooriya, N., Poveda-Villal´ on, M., Garc´ ıa-Castro, R., G´ omez-P´ erez, A.: Loupe - an online tool for inspecting datasets in the linked data cloud. In: ISWC Posters & Demonstrations. (2015)

  15. [15]

    Oxford University Press (2010)

    Newman, M.E.J.: Networks: An Introduction. Oxford University Press (2010)

  16. [16]

    Page, L., Brin, S., Motwani, R., Winograd, T.: The pagerank citation ranking: Bringing order to the web. Tech. rep., Stanford InfoLab (1999)

  17. [17]

    In: SIGMOD

    Qiao, S., zsoyoglu, Z.M.: RBench: Application-Specific RDF Benchmarking. In: SIGMOD. pp. 1825–1838. ACM (2015)

  18. [18]

    In: ISWC

    Schmachtenberg, M., Bizer, C., Paulheim, H.: Adoption of the linked data best practices in different topical domains. In: ISWC. pp. 245–260 (2014)

  19. [19]

    In: ISWC

    Sejdiu, G., Ermilov, I., Lehmann, J., Mami, M.N.: DistLODStats: Distributed com- putation of RDF dataset statistics. In: ISWC. pp. 206–222 (2018)

  20. [20]

    PVLDB, Chal- lenges and Visions 4(12), 1470–1473 (2011)

    Tay, Y.C.: Data generation for application-specific benchmarking. PVLDB, Chal- lenges and Visions 4(12), 1470–1473 (2011)

  21. [21]

    IEEE TKDE 20(5), 692–702 (2008)

    Theoharis, Y., Tzitzikas, Y., Kotzinos, D., Christophides, V.: On graph features of semantic web schemas. IEEE TKDE 20(5), 692–702 (2008)