Which Are the Low-Resource Languages of the Semantic Web?
Pith reviewed 2026-05-08 10:58 UTC · model grok-4.3
The pith
A methodology using DBpedia, BabelNet and Wikidata defines low-, medium- and high-resource languages for Linked Open Data.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The authors present a methodology to analyze the distribution of languages across LOD KGs and propose a preliminary multi-level categorization based on DBpedia, BabelNet, and Wikidata. This categorization brings a formal definition of low-, high-, and medium-resource languages that could be leveraged to select cross-lingual transfer candidates.
What carries the argument
The multi-level categorization of languages derived from their presence and distribution statistics in DBpedia, BabelNet, and Wikidata, which serves as a quantitative proxy for resource levels in LOD KGs.
Load-bearing premise
That the language distribution statistics from DBpedia, BabelNet, and Wikidata serve as a valid measure of resource availability for cross-lingual transfer in the Semantic Web overall.
What would settle it
Finding a language that has very low representation in DBpedia, BabelNet, and Wikidata yet supports successful cross-lingual transfer using other LOD sources would challenge the definitions.
Figures
read the original abstract
Emerging digital technologies are exacerbating the existing divide in Open Access Data (OAD) between high-and low-resource languages, excluding many communities from the global digital transformation. Multilingual Linked Open Data Knowledge Graphs (LOD KGs) could contribute to mitigating this divide through cross-lingual transfer; however, no clear quantitative definition of low-resource languages has yet been established in the context of LOD KGs. In this poster, we present a methodology to analyze the distribution of languages across LOD KGs and propose a preliminary multi-level categorization based on DBpedia, BabelNet, and Wikidata. This categorization is leveraged to bring a formal definition of low-, high-, and medium-resource languages that could be later leveraged to select cross-lingual transfer candidates.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper presents a methodology to analyze the distribution of languages across Linked Open Data Knowledge Graphs (LOD KGs), specifically using DBpedia, BabelNet, and Wikidata. It proposes a preliminary multi-level categorization of languages based on this analysis and derives a formal definition of low-, medium-, and high-resource languages intended to support selection of candidates for cross-lingual transfer.
Significance. If the methodology and resulting definitions are sound, the work addresses a genuine gap by providing an empirical, data-driven starting point for quantifying language resources in the Semantic Web. This could aid future efforts in cross-lingual transfer and help mitigate digital divides. The use of multiple established KGs (DBpedia, BabelNet, Wikidata) for the analysis is a positive aspect that supports robustness and potential reproducibility.
major comments (1)
- [Methodology and Results sections] The manuscript describes the intent to present a methodology and formal definitions but provides no specific details on analysis methods, thresholds for categorization, or quantitative results (e.g., language counts or distribution statistics per KG). This makes it impossible to assess whether the proposed definitions are well-supported or reproducible.
minor comments (1)
- [Abstract] The abstract and introduction could include at least one concrete example of a language categorized as low-resource with supporting statistics from the KGs to illustrate the approach.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback and positive evaluation of the work's significance. We address the single major comment below and will revise the manuscript accordingly.
read point-by-point responses
-
Referee: [Methodology and Results sections] The manuscript describes the intent to present a methodology and formal definitions but provides no specific details on analysis methods, thresholds for categorization, or quantitative results (e.g., language counts or distribution statistics per KG). This makes it impossible to assess whether the proposed definitions are well-supported or reproducible.
Authors: We agree that the current poster version lacks the quantitative details and explicit thresholds needed for full assessment and reproducibility. The poster format limited space for these elements. In the revised manuscript we will add: (1) a description of the exact analysis methods (e.g., how language distributions were extracted and normalized across the three KGs), (2) the concrete thresholds and criteria used for the multi-level categorization, and (3) summary statistics including language counts and distribution figures per KG. These additions will be placed in expanded Methodology and Results sections. revision: yes
Circularity Check
No significant circularity detected
full rationale
The paper presents an empirical methodology to analyze language distributions across three external LOD KGs (DBpedia, BabelNet, Wikidata) and derives a preliminary multi-level categorization yielding a formal definition of low-/medium-/high-resource languages. This definition is explicitly framed as a starting point for future use rather than a validated or derived result. No equations, fitted parameters, self-definitional claims, or load-bearing self-citations appear in the derivation chain; all inputs are independent external data sources. The central contribution is self-contained against external benchmarks with no reduction to its own inputs by construction.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Language distribution in DBpedia, BabelNet, and Wikidata proxies resource level for LOD KGs.
Reference graph
Works this paper leans on
-
[1]
Helm, P., Bella, G., Koch, G., Giunchiglia, F.: Diversity and language technology: how language modeling bias causes epistemic injustice. Ethics Inf. Technol.26(1), 8 (2024). https://doi.org/10.1007/S10676-023-09742-6
-
[2]
The State and Fate of Linguistic Diversity and Inclusion in the NLP World
Joshi, P., Santy, S., Budhiraja, A., Bali, K., Choudhury, M.: The state and fate of linguistic diversity and inclusion in the NLP world. In: Proceedings of the 58th An- nual Meeting of the Association for Computational Linguistics, ACL 2020, Online, July 5-10, 2020. pp. 6282–6293. Association for Computational Linguistics (2020). https://doi.org/10.18653/...
-
[3]
Hellina Hailu Nigatu, Atnafu Lambebo Tonja, Benjamin Rosman, Thamar Solorio, and Monojit Choudhury
Nigatu, H.H., Tonja, A.L., Rosman, B., Solorio, T., Choudhury, M.: The zeno’s paradox of ’low-resource’ languages. In: Proceedings of the 2024 Conference on Em- pirical Methods in Natural Language Processing, EMNLP 2024, Miami, FL, USA, November 12-16, 2024. pp. 17753–17774. Association for Computational Linguistics (2024). https://doi.org/10.18653/V1/202...
-
[4]
In: Proceedings of the Thirteenth Language Re- sources and Evaluation Conference
V¯iksna, R., Skadin,a, I., Skadin,š, R., Vasil,jevs, A., Rozis, R.: Assessing multilingual- ity of publicly accessible websites. In: Proceedings of the Thirteenth Language Re- sources and Evaluation Conference. pp. 2108–2116. European Language Resources Association (Jun 2022)
work page 2022
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.