pith. sign in

arxiv: 1906.12089 · v1 · pith:SAA32E5Wnew · submitted 2019-06-28 · 💻 cs.IR · cs.AI· cs.DB

Uncovering the Semantics of Wikipedia Categories

Pith reviewed 2026-05-25 13:34 UTC · model grok-4.3

classification 💻 cs.IR cs.AIcs.DB
keywords wikipedia categoriesaxiom discoveryknowledge graph enrichmentdbpediacategory semanticsrelation extractiontype assertionssemantic web
0
0 comments X

The pith

Wikipedia categories encode extractable type and relation axioms that enrich DBpedia with millions of facts.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops a method to discover axioms about Wikipedia categories by combining signals from the category graph structure, the Wikipedia pages assigned to each category, and the words appearing in category names. DBpedia serves as background knowledge to interpret those signals and validate the resulting axioms. The method produces 703k axioms for 502k categories and uses them to add 4.4M relation assertions plus 3.3M type assertions to DBpedia. A reader would care because Wikipedia categories already exist at scale yet have been only partially exploited for structured knowledge.

Core claim

We introduce an approach for the discovery of category axioms that uses information from the category network, category instances, and their lexicalisations. With DBpedia as background knowledge, we discover 703k axioms covering 502k of Wikipedia's categories and populate the DBpedia knowledge graph with additional 4.4M relation assertions and 3.3M type assertions at more than 87% and 90% precision, respectively.

What carries the argument

The axiom discovery approach that integrates category network structure, instance membership, and lexical cues with DBpedia as background knowledge to infer category semantics.

Load-bearing premise

Signals from the category network, its instances, and lexical forms can be combined reliably with DBpedia to identify correct category axioms.

What would settle it

A manual evaluation on a large random sample of the 703k discovered axioms and the 7.7M added assertions finds precision below 80 percent.

Figures

Figures reproduced from arXiv: 1906.12089 by Heiko Paulheim, Nicolas Heist.

Figure 1
Figure 1. Figure 1: Excerpt of the Wikipedia category graph showing the category [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Overview of the Cat2Ax approach. knowledge and lexicalisations for the decision of whether a pattern is applicable to the category. Finally, we generate assertions by applying the axioms of a category to its resources and subsequently use post-filtering to remove assertions that would create contradictions in the knowledge graph. 4.1 Candidate Selection In this first step, we want to extract sets of catego… view at source ↗
Figure 3
Figure 3. Figure 3: We can observe that the precision considerably drops for a threshold [PITH_FULL_IMAGE:figures/full_fig_p012_3.png] view at source ↗
Figure 3
Figure 3. Figure 3: Performance of the pattern application for varying confidence intervals. [PITH_FULL_IMAGE:figures/full_fig_p013_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Comparison of the extracted results. graph linked to Wikipedia (or DBpedia) can be extended with the approach discussed in this paper. This holds, e.g., for YAGO and Wikidata. Moreover, the approach could also be applied to knowledge graphs created from other Wikis, such as DBkWik [11], or used with different hierarchies, such as the Wikipedia Bitaxonomy [4] or WebIsALOD [10]. Hence, Cat2Ax has general pot… view at source ↗
read the original abstract

The Wikipedia category graph serves as the taxonomic backbone for large-scale knowledge graphs like YAGO or Probase, and has been used extensively for tasks like entity disambiguation or semantic similarity estimation. Wikipedia's categories are a rich source of taxonomic as well as non-taxonomic information. The category 'German science fiction writers', for example, encodes the type of its resources (Writer), as well as their nationality (German) and genre (Science Fiction). Several approaches in the literature make use of fractions of this encoded information without exploiting its full potential. In this paper, we introduce an approach for the discovery of category axioms that uses information from the category network, category instances, and their lexicalisations. With DBpedia as background knowledge, we discover 703k axioms covering 502k of Wikipedia's categories and populate the DBpedia knowledge graph with additional 4.4M relation assertions and 3.3M type assertions at more than 87% and 90% precision, respectively.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper introduces an approach for the discovery of category axioms that combines information from the Wikipedia category network, category instances, and their lexicalisations. Using DBpedia as background knowledge, it reports discovering 703k axioms covering 502k categories and adding 4.4M relation assertions and 3.3M type assertions to the DBpedia knowledge graph at more than 87% and 90% precision, respectively.

Significance. If the reported scale and precision figures hold under a clearly described and non-circular methodology, the work would be significant for enriching large-scale knowledge graphs such as DBpedia, YAGO, and Probase with semantic information from Wikipedia categories, benefiting tasks like entity disambiguation and semantic similarity estimation.

major comments (2)
  1. [Abstract] Abstract: the claim of 703k axioms, 4.4M relation assertions, and 3.3M type assertions at the stated precisions cannot be verified because the abstract supplies no description of the discovery algorithm, evaluation methodology, or error analysis.
  2. The method uses DBpedia both as background knowledge and as the target for new assertions; this creates a potential circular dependence whose impact on the reported precision figures must be explicitly ruled out or quantified.
minor comments (1)
  1. Clarify how the combination of network structure, instance data, and lexical cues avoids over-reliance on DBpedia labels that may already encode the target relations.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive feedback. We address the major comments point by point below, indicating the revisions we will incorporate.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the claim of 703k axioms, 4.4M relation assertions, and 3.3M type assertions at the stated precisions cannot be verified because the abstract supplies no description of the discovery algorithm, evaluation methodology, or error analysis.

    Authors: We agree that the submitted abstract emphasizes results without describing the method or evaluation. The full paper details the approach (combining category network, instances, and lexicalisations with DBpedia background) and evaluation (manual sampling for precision) in later sections. We will revise the abstract to include a concise summary of the discovery algorithm, evaluation methodology, and error analysis to improve verifiability. revision: yes

  2. Referee: The method uses DBpedia both as background knowledge and as the target for new assertions; this creates a potential circular dependence whose impact on the reported precision figures must be explicitly ruled out or quantified.

    Authors: This concern is valid and merits explicit treatment. DBpedia provides background facts to support axiom extraction from Wikipedia categories, while new assertions are additions to DBpedia. Precision was assessed through independent manual evaluation of samples. In the revision we will add analysis quantifying any overlap between background knowledge and new assertions, and demonstrate that the reported precisions (>87% relations, >90% types) are not affected by circularity. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation uses external background knowledge

full rationale

The paper describes an inductive process that combines Wikipedia category network structure, instance data, and lexical cues with DBpedia as independent background knowledge to induce axioms, then emits new assertions back into DBpedia. No step is shown to reduce by construction to a fitted parameter, self-definition, or self-citation chain; the background KG and target corpus are treated as distinct inputs, and the reported scale and precision figures are presented as empirical outcomes rather than tautological outputs. The derivation is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Review based on abstract only; the central claim rests on DBpedia serving as reliable background knowledge and on the assumption that category signals are sufficiently informative. No free parameters, axioms, or invented entities are described.

pith-pipeline@v0.9.0 · 5693 in / 1051 out tokens · 35123 ms · 2026-05-25T13:34:38.802074+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

33 extracted references · 33 canonical work pages

  1. [1]

    In: NLP-DBpedia@ ISWC (2013)

    Aprosio, A.P., Giuliano, C., Lavelli, A.: Extending the coverage of DBpedia prop- erties using distant supervision over Wikipedia. In: NLP-DBpedia@ ISWC (2013)

  2. [2]

    In: Workshop on NLP&DBpedia

    Bryl, V., Bizer, C., Paulheim, H.: Gathering alternative surface forms for dbpedia entities. In: Workshop on NLP&DBpedia. pp. 13–24 (2015)

  3. [3]

    Semantic Web pp

    F¨ arber, M., Bartscherer, F., Menne, C., Rettinger, A.: Linked data quality of DB- pedia, Freebase, OpenCyc, Wikidata, and YAGO. Semantic Web pp. 1–53 (2016)

  4. [4]

    In: 52nd Annual Meeting of the ACL

    Flati, T., et al.: Two is bigger (and better) than one: the Wikipedia bitaxonomy project. In: 52nd Annual Meeting of the ACL. vol. 1, pp. 945–955 (2014)

  5. [5]

    Psychological bulletin 76(5), 378 (1971)

    Fleiss, J.L.: Measuring nominal scale agreement among many raters. Psychological bulletin 76(5), 378 (1971)

  6. [6]

    In: 11th International Conference on Semantic Systems

    Fossati, M., Kontokostas, D., Lehmann, J.: Unsupervised learning of an extensive and usable taxonomy for DBpedia. In: 11th International Conference on Semantic Systems. pp. 177–184. ACM (2015)

  7. [7]

    In: 1st Workshop on Web Scale Knowledge Extraction@ ISWC

    Gerber, D., Ngomo, A.C.N.: Bootstrapping the linked data web. In: 1st Workshop on Web Scale Knowledge Extraction@ ISWC. vol. 2011 (2011)

  8. [8]

    In: 14th Conference on Computational Linguistics

    Hearst, M.A.: Automatic acquisition of hyponyms from large text corpora. In: 14th Conference on Computational Linguistics. vol. 2, pp. 539–545 (1992)

  9. [9]

    Information 9(4), 75 (2018)

    Heist, N., Hertling, S., Paulheim, H.: Language-agnostic relation extraction from abstracts in wikis. Information 9(4), 75 (2018)

  10. [10]

    In: International Semantic Web Conference

    Hertling, S., Paulheim, H.: WebIsALOD: providing hypernymy relations extracted from the web as linked open data. In: International Semantic Web Conference. pp. 111–119. Springer (2017) Uncovering the Semantics of Wikipedia Categories 17

  11. [11]

    In: IEEE International Conference on Big Knowledge, ICBK (2018)

    Hertling, S., Paulheim, H.: DBkWik: A consolidated knowledge graph from thou- sands of wikis. In: IEEE International Conference on Big Knowledge, ICBK (2018)

  12. [12]

    In: 48th annual meeting of the ACL

    Kozareva, Z., Hovy, E.: Learning arguments and supertypes of semantic relations using recursive patterns. In: 48th annual meeting of the ACL. pp. 1482–1491. ACL (2010)

  13. [13]

    Inform- atik (2016)

    Kuhn, P., Mischkewitz, S., et al.: Type inference on Wikipedia list pages. Inform- atik (2016)

  14. [14]

    biometrics pp

    Landis, J.R., Koch, G.G.: The measurement of observer agreement for categorical data. biometrics pp. 159–174 (1977)

  15. [15]

    Journal of Ma- chine Learning Research 10(Nov), 2639–2642 (2009)

    Lehmann, J.: DL-Learner: learning concepts in description logics. Journal of Ma- chine Learning Research 10(Nov), 2639–2642 (2009)

  16. [16]

    Semantic Web 6(2), 167–195 (2015)

    Lehmann, J., Isele, R., Jakob, M., et al.: Dbpedia–a large-scale, multilingual know- ledge base extracted from Wikipedia. Semantic Web 6(2), 167–195 (2015)

  17. [17]

    In: Asian Semantic Web Conference

    Liu, Q., Xu, K., et al.: Catriple: Extracting triples from Wikipedia categories. In: Asian Semantic Web Conference. pp. 330–344. Springer (2008)

  18. [18]

    In: CIDR (2013)

    Mahdisoltani, F., Biega, J., Suchanek, F.M.: YAGO3: A knowledge base from mul- tilingual Wikipedias. In: CIDR (2013)

  19. [19]

    In: ACL-AFNLP

    Mintz, M., Bills, S., et al.: Distant supervision for relation extraction without labeled data. In: ACL-AFNLP. vol. 2, pp. 1003–1011 (2009)

  20. [20]

    LD4IE@ ISWC 1057 (2013)

    Mu˜ noz, E., Hogan, A., Mileo, A.: Triplifying Wikipedia’s tables. LD4IE@ ISWC 1057 (2013)

  21. [21]

    In: AAAI

    Nastase, V., Strube, M.: Decoding Wikipedia categories for knowledge acquisition. In: AAAI. vol. 8, pp. 1219–1224 (2008)

  22. [22]

    Semantic Web 8(3), 489–508 (2017)

    Paulheim, H.: Knowledge graph refinement: A survey of approaches and evaluation methods. Semantic Web 8(3), 489–508 (2017)

  23. [23]

    In: International Se- mantic Web Conference

    Paulheim, H., Bizer, C.: Type inference on noisy RDF data. In: International Se- mantic Web Conference. pp. 510–525. Springer (2013)

  24. [24]

    NLP- DBpedia ISWC 13 (2013)

    Paulheim, H., Ponzetto, S.P.: Extending DBpedia with Wikipedia list pages. NLP- DBpedia ISWC 13 (2013)

  25. [25]

    In: AAAI

    Ponzetto, S.P., Strube, M.: Deriving a large scale taxonomy from Wikipedia. In: AAAI. vol. 7, pp. 1440–1445 (2007)

  26. [26]

    Data Mining and Knowledge Discovery 24(3), 613–662 (2012)

    Rettinger, A., L¨ osch, U., Tresp, V., dAmato, C., Fanizzi, N.: Mining the semantic web. Data Mining and Knowledge Discovery 24(3), 613–662 (2012)

  27. [27]

    In: Joint German/Austrian Conference on Artificial Intelligence

    Ringler, D., Paulheim, H.: One knowledge graph to rule them all? analyzing the differences between DBpedia, YAGO, Wikidata & co. In: Joint German/Austrian Conference on Artificial Intelligence. pp. 366–372. Springer (2017)

  28. [28]

    In: 5th International Conference on Web Intelligence, Mining and Semantics

    Ritze, D., Lehmberg, O., Bizer, C.: Matching HTML tables to DBpedia. In: 5th International Conference on Web Intelligence, Mining and Semantics. p. 10. ACM, New York (2015)

  29. [29]

    In: 16th International Conference on World Wide Web

    Suchanek, F.M., Kasneci, G., Weikum, G.: YAGO: a core of semantic knowledge. In: 16th International Conference on World Wide Web. pp. 697–706. ACM (2007)

  30. [30]

    Computational Linguistics 39(3), 665–707 (2013)

    Velardi, P., Faralli, S., Navigli, R.: OntoLearn reloaded: A graph-based algorithm for taxonomy induction. Computational Linguistics 39(3), 665–707 (2013)

  31. [31]

    Com- munications of the ACM 57(10), 78–85 (2014)

    Vrandeˇ ci´ c, D., Kr¨ otzsch, M.: Wikidata: a free collaborative knowledgebase. Com- munications of the ACM 57(10), 78–85 (2014)

  32. [32]

    In: IJCAI

    Xu, B., Xie, C., et al.: Learning defining features for categories. In: IJCAI. pp. 3924–3930 (2016)

  33. [33]

    Semantic Web 7(1), 63–93 (2016)

    Zaveri, A., Rula, A., Maurino, A., et al.: Quality assessment for linked data: A survey. Semantic Web 7(1), 63–93 (2016)