Uncovering the Semantics of Wikipedia Categories
Pith reviewed 2026-05-25 13:34 UTC · model grok-4.3
The pith
Wikipedia categories encode extractable type and relation axioms that enrich DBpedia with millions of facts.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
We introduce an approach for the discovery of category axioms that uses information from the category network, category instances, and their lexicalisations. With DBpedia as background knowledge, we discover 703k axioms covering 502k of Wikipedia's categories and populate the DBpedia knowledge graph with additional 4.4M relation assertions and 3.3M type assertions at more than 87% and 90% precision, respectively.
What carries the argument
The axiom discovery approach that integrates category network structure, instance membership, and lexical cues with DBpedia as background knowledge to infer category semantics.
Load-bearing premise
Signals from the category network, its instances, and lexical forms can be combined reliably with DBpedia to identify correct category axioms.
What would settle it
A manual evaluation on a large random sample of the 703k discovered axioms and the 7.7M added assertions finds precision below 80 percent.
Figures
read the original abstract
The Wikipedia category graph serves as the taxonomic backbone for large-scale knowledge graphs like YAGO or Probase, and has been used extensively for tasks like entity disambiguation or semantic similarity estimation. Wikipedia's categories are a rich source of taxonomic as well as non-taxonomic information. The category 'German science fiction writers', for example, encodes the type of its resources (Writer), as well as their nationality (German) and genre (Science Fiction). Several approaches in the literature make use of fractions of this encoded information without exploiting its full potential. In this paper, we introduce an approach for the discovery of category axioms that uses information from the category network, category instances, and their lexicalisations. With DBpedia as background knowledge, we discover 703k axioms covering 502k of Wikipedia's categories and populate the DBpedia knowledge graph with additional 4.4M relation assertions and 3.3M type assertions at more than 87% and 90% precision, respectively.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces an approach for the discovery of category axioms that combines information from the Wikipedia category network, category instances, and their lexicalisations. Using DBpedia as background knowledge, it reports discovering 703k axioms covering 502k categories and adding 4.4M relation assertions and 3.3M type assertions to the DBpedia knowledge graph at more than 87% and 90% precision, respectively.
Significance. If the reported scale and precision figures hold under a clearly described and non-circular methodology, the work would be significant for enriching large-scale knowledge graphs such as DBpedia, YAGO, and Probase with semantic information from Wikipedia categories, benefiting tasks like entity disambiguation and semantic similarity estimation.
major comments (2)
- [Abstract] Abstract: the claim of 703k axioms, 4.4M relation assertions, and 3.3M type assertions at the stated precisions cannot be verified because the abstract supplies no description of the discovery algorithm, evaluation methodology, or error analysis.
- The method uses DBpedia both as background knowledge and as the target for new assertions; this creates a potential circular dependence whose impact on the reported precision figures must be explicitly ruled out or quantified.
minor comments (1)
- Clarify how the combination of network structure, instance data, and lexical cues avoids over-reliance on DBpedia labels that may already encode the target relations.
Simulated Author's Rebuttal
We thank the referee for their constructive feedback. We address the major comments point by point below, indicating the revisions we will incorporate.
read point-by-point responses
-
Referee: [Abstract] Abstract: the claim of 703k axioms, 4.4M relation assertions, and 3.3M type assertions at the stated precisions cannot be verified because the abstract supplies no description of the discovery algorithm, evaluation methodology, or error analysis.
Authors: We agree that the submitted abstract emphasizes results without describing the method or evaluation. The full paper details the approach (combining category network, instances, and lexicalisations with DBpedia background) and evaluation (manual sampling for precision) in later sections. We will revise the abstract to include a concise summary of the discovery algorithm, evaluation methodology, and error analysis to improve verifiability. revision: yes
-
Referee: The method uses DBpedia both as background knowledge and as the target for new assertions; this creates a potential circular dependence whose impact on the reported precision figures must be explicitly ruled out or quantified.
Authors: This concern is valid and merits explicit treatment. DBpedia provides background facts to support axiom extraction from Wikipedia categories, while new assertions are additions to DBpedia. Precision was assessed through independent manual evaluation of samples. In the revision we will add analysis quantifying any overlap between background knowledge and new assertions, and demonstrate that the reported precisions (>87% relations, >90% types) are not affected by circularity. revision: yes
Circularity Check
No significant circularity; derivation uses external background knowledge
full rationale
The paper describes an inductive process that combines Wikipedia category network structure, instance data, and lexical cues with DBpedia as independent background knowledge to induce axioms, then emits new assertions back into DBpedia. No step is shown to reduce by construction to a fitted parameter, self-definition, or self-citation chain; the background KG and target corpus are treated as distinct inputs, and the reported scale and precision figures are presented as empirical outcomes rather than tautological outputs. The derivation is therefore self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Aprosio, A.P., Giuliano, C., Lavelli, A.: Extending the coverage of DBpedia prop- erties using distant supervision over Wikipedia. In: NLP-DBpedia@ ISWC (2013)
work page 2013
-
[2]
Bryl, V., Bizer, C., Paulheim, H.: Gathering alternative surface forms for dbpedia entities. In: Workshop on NLP&DBpedia. pp. 13–24 (2015)
work page 2015
-
[3]
F¨ arber, M., Bartscherer, F., Menne, C., Rettinger, A.: Linked data quality of DB- pedia, Freebase, OpenCyc, Wikidata, and YAGO. Semantic Web pp. 1–53 (2016)
work page 2016
-
[4]
In: 52nd Annual Meeting of the ACL
Flati, T., et al.: Two is bigger (and better) than one: the Wikipedia bitaxonomy project. In: 52nd Annual Meeting of the ACL. vol. 1, pp. 945–955 (2014)
work page 2014
-
[5]
Psychological bulletin 76(5), 378 (1971)
Fleiss, J.L.: Measuring nominal scale agreement among many raters. Psychological bulletin 76(5), 378 (1971)
work page 1971
-
[6]
In: 11th International Conference on Semantic Systems
Fossati, M., Kontokostas, D., Lehmann, J.: Unsupervised learning of an extensive and usable taxonomy for DBpedia. In: 11th International Conference on Semantic Systems. pp. 177–184. ACM (2015)
work page 2015
-
[7]
In: 1st Workshop on Web Scale Knowledge Extraction@ ISWC
Gerber, D., Ngomo, A.C.N.: Bootstrapping the linked data web. In: 1st Workshop on Web Scale Knowledge Extraction@ ISWC. vol. 2011 (2011)
work page 2011
-
[8]
In: 14th Conference on Computational Linguistics
Hearst, M.A.: Automatic acquisition of hyponyms from large text corpora. In: 14th Conference on Computational Linguistics. vol. 2, pp. 539–545 (1992)
work page 1992
-
[9]
Heist, N., Hertling, S., Paulheim, H.: Language-agnostic relation extraction from abstracts in wikis. Information 9(4), 75 (2018)
work page 2018
-
[10]
In: International Semantic Web Conference
Hertling, S., Paulheim, H.: WebIsALOD: providing hypernymy relations extracted from the web as linked open data. In: International Semantic Web Conference. pp. 111–119. Springer (2017) Uncovering the Semantics of Wikipedia Categories 17
work page 2017
-
[11]
In: IEEE International Conference on Big Knowledge, ICBK (2018)
Hertling, S., Paulheim, H.: DBkWik: A consolidated knowledge graph from thou- sands of wikis. In: IEEE International Conference on Big Knowledge, ICBK (2018)
work page 2018
-
[12]
In: 48th annual meeting of the ACL
Kozareva, Z., Hovy, E.: Learning arguments and supertypes of semantic relations using recursive patterns. In: 48th annual meeting of the ACL. pp. 1482–1491. ACL (2010)
work page 2010
-
[13]
Kuhn, P., Mischkewitz, S., et al.: Type inference on Wikipedia list pages. Inform- atik (2016)
work page 2016
-
[14]
Landis, J.R., Koch, G.G.: The measurement of observer agreement for categorical data. biometrics pp. 159–174 (1977)
work page 1977
-
[15]
Journal of Ma- chine Learning Research 10(Nov), 2639–2642 (2009)
Lehmann, J.: DL-Learner: learning concepts in description logics. Journal of Ma- chine Learning Research 10(Nov), 2639–2642 (2009)
work page 2009
-
[16]
Semantic Web 6(2), 167–195 (2015)
Lehmann, J., Isele, R., Jakob, M., et al.: Dbpedia–a large-scale, multilingual know- ledge base extracted from Wikipedia. Semantic Web 6(2), 167–195 (2015)
work page 2015
-
[17]
In: Asian Semantic Web Conference
Liu, Q., Xu, K., et al.: Catriple: Extracting triples from Wikipedia categories. In: Asian Semantic Web Conference. pp. 330–344. Springer (2008)
work page 2008
-
[18]
Mahdisoltani, F., Biega, J., Suchanek, F.M.: YAGO3: A knowledge base from mul- tilingual Wikipedias. In: CIDR (2013)
work page 2013
-
[19]
Mintz, M., Bills, S., et al.: Distant supervision for relation extraction without labeled data. In: ACL-AFNLP. vol. 2, pp. 1003–1011 (2009)
work page 2009
-
[20]
Mu˜ noz, E., Hogan, A., Mileo, A.: Triplifying Wikipedia’s tables. LD4IE@ ISWC 1057 (2013)
work page 2013
- [21]
-
[22]
Semantic Web 8(3), 489–508 (2017)
Paulheim, H.: Knowledge graph refinement: A survey of approaches and evaluation methods. Semantic Web 8(3), 489–508 (2017)
work page 2017
-
[23]
In: International Se- mantic Web Conference
Paulheim, H., Bizer, C.: Type inference on noisy RDF data. In: International Se- mantic Web Conference. pp. 510–525. Springer (2013)
work page 2013
-
[24]
Paulheim, H., Ponzetto, S.P.: Extending DBpedia with Wikipedia list pages. NLP- DBpedia ISWC 13 (2013)
work page 2013
- [25]
-
[26]
Data Mining and Knowledge Discovery 24(3), 613–662 (2012)
Rettinger, A., L¨ osch, U., Tresp, V., dAmato, C., Fanizzi, N.: Mining the semantic web. Data Mining and Knowledge Discovery 24(3), 613–662 (2012)
work page 2012
-
[27]
In: Joint German/Austrian Conference on Artificial Intelligence
Ringler, D., Paulheim, H.: One knowledge graph to rule them all? analyzing the differences between DBpedia, YAGO, Wikidata & co. In: Joint German/Austrian Conference on Artificial Intelligence. pp. 366–372. Springer (2017)
work page 2017
-
[28]
In: 5th International Conference on Web Intelligence, Mining and Semantics
Ritze, D., Lehmberg, O., Bizer, C.: Matching HTML tables to DBpedia. In: 5th International Conference on Web Intelligence, Mining and Semantics. p. 10. ACM, New York (2015)
work page 2015
-
[29]
In: 16th International Conference on World Wide Web
Suchanek, F.M., Kasneci, G., Weikum, G.: YAGO: a core of semantic knowledge. In: 16th International Conference on World Wide Web. pp. 697–706. ACM (2007)
work page 2007
-
[30]
Computational Linguistics 39(3), 665–707 (2013)
Velardi, P., Faralli, S., Navigli, R.: OntoLearn reloaded: A graph-based algorithm for taxonomy induction. Computational Linguistics 39(3), 665–707 (2013)
work page 2013
-
[31]
Com- munications of the ACM 57(10), 78–85 (2014)
Vrandeˇ ci´ c, D., Kr¨ otzsch, M.: Wikidata: a free collaborative knowledgebase. Com- munications of the ACM 57(10), 78–85 (2014)
work page 2014
- [32]
-
[33]
Semantic Web 7(1), 63–93 (2016)
Zaveri, A., Rula, A., Maurino, A., et al.: Quality assessment for linked data: A survey. Semantic Web 7(1), 63–93 (2016)
work page 2016
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.