pith. sign in

arxiv: 2605.05929 · v1 · submitted 2026-05-07 · 💻 cs.AI

Which Are the Low-Resource Languages of the Semantic Web?

Pith reviewed 2026-05-08 10:58 UTC · model grok-4.3

classification 💻 cs.AI
keywords low-resource languagesSemantic WebLinked Open Datacross-lingual transferDBpediaBabelNetWikidatalanguage distribution
0
0 comments X

The pith

A methodology using DBpedia, BabelNet and Wikidata defines low-, medium- and high-resource languages for Linked Open Data.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops a way to measure how well different languages are represented in major multilingual knowledge graphs on the Semantic Web. By examining language distributions in DBpedia, BabelNet, and Wikidata, the authors create categories that label languages as low-, medium-, or high-resource. This matters because clear definitions are needed to choose which languages can benefit from cross-lingual transfer techniques to reduce the digital divide. Without them, efforts to make open data accessible across languages stay informal and hard to scale.

Core claim

The authors present a methodology to analyze the distribution of languages across LOD KGs and propose a preliminary multi-level categorization based on DBpedia, BabelNet, and Wikidata. This categorization brings a formal definition of low-, high-, and medium-resource languages that could be leveraged to select cross-lingual transfer candidates.

What carries the argument

The multi-level categorization of languages derived from their presence and distribution statistics in DBpedia, BabelNet, and Wikidata, which serves as a quantitative proxy for resource levels in LOD KGs.

Load-bearing premise

That the language distribution statistics from DBpedia, BabelNet, and Wikidata serve as a valid measure of resource availability for cross-lingual transfer in the Semantic Web overall.

What would settle it

Finding a language that has very low representation in DBpedia, BabelNet, and Wikidata yet supports successful cross-lingual transfer using other LOD sources would challenge the definitions.

Figures

Figures reproduced from arXiv: 2605.05929 by Fabien Gandon (WIMMICS), Miguel Couceiro (INESC-ID), Ndeye-Emilie Mbengue (WIMMICS), Pierre Monnin (WIMMICS).

Figure 1
Figure 1. Figure 1: Language coverage (log-log) in BabelNet, Wikidata, DBpedia, and in view at source ↗
Figure 2
Figure 2. Figure 2: Language coverage (log-log) in the aggregated LOD KGs. The x-axis view at source ↗
read the original abstract

Emerging digital technologies are exacerbating the existing divide in Open Access Data (OAD) between high-and low-resource languages, excluding many communities from the global digital transformation. Multilingual Linked Open Data Knowledge Graphs (LOD KGs) could contribute to mitigating this divide through cross-lingual transfer; however, no clear quantitative definition of low-resource languages has yet been established in the context of LOD KGs. In this poster, we present a methodology to analyze the distribution of languages across LOD KGs and propose a preliminary multi-level categorization based on DBpedia, BabelNet, and Wikidata. This categorization is leveraged to bring a formal definition of low-, high-, and medium-resource languages that could be later leveraged to select cross-lingual transfer candidates.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 1 minor

Summary. The paper presents a methodology to analyze the distribution of languages across Linked Open Data Knowledge Graphs (LOD KGs), specifically using DBpedia, BabelNet, and Wikidata. It proposes a preliminary multi-level categorization of languages based on this analysis and derives a formal definition of low-, medium-, and high-resource languages intended to support selection of candidates for cross-lingual transfer.

Significance. If the methodology and resulting definitions are sound, the work addresses a genuine gap by providing an empirical, data-driven starting point for quantifying language resources in the Semantic Web. This could aid future efforts in cross-lingual transfer and help mitigate digital divides. The use of multiple established KGs (DBpedia, BabelNet, Wikidata) for the analysis is a positive aspect that supports robustness and potential reproducibility.

major comments (1)
  1. [Methodology and Results sections] The manuscript describes the intent to present a methodology and formal definitions but provides no specific details on analysis methods, thresholds for categorization, or quantitative results (e.g., language counts or distribution statistics per KG). This makes it impossible to assess whether the proposed definitions are well-supported or reproducible.
minor comments (1)
  1. [Abstract] The abstract and introduction could include at least one concrete example of a language categorized as low-resource with supporting statistics from the KGs to illustrate the approach.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback and positive evaluation of the work's significance. We address the single major comment below and will revise the manuscript accordingly.

read point-by-point responses
  1. Referee: [Methodology and Results sections] The manuscript describes the intent to present a methodology and formal definitions but provides no specific details on analysis methods, thresholds for categorization, or quantitative results (e.g., language counts or distribution statistics per KG). This makes it impossible to assess whether the proposed definitions are well-supported or reproducible.

    Authors: We agree that the current poster version lacks the quantitative details and explicit thresholds needed for full assessment and reproducibility. The poster format limited space for these elements. In the revised manuscript we will add: (1) a description of the exact analysis methods (e.g., how language distributions were extracted and normalized across the three KGs), (2) the concrete thresholds and criteria used for the multi-level categorization, and (3) summary statistics including language counts and distribution figures per KG. These additions will be placed in expanded Methodology and Results sections. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper presents an empirical methodology to analyze language distributions across three external LOD KGs (DBpedia, BabelNet, Wikidata) and derives a preliminary multi-level categorization yielding a formal definition of low-/medium-/high-resource languages. This definition is explicitly framed as a starting point for future use rather than a validated or derived result. No equations, fitted parameters, self-definitional claims, or load-bearing self-citations appear in the derivation chain; all inputs are independent external data sources. The central contribution is self-contained against external benchmarks with no reduction to its own inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Central claim rests on domain assumption that KG coverage proxies resource level; no free parameters or invented entities.

axioms (1)
  • domain assumption Language distribution in DBpedia, BabelNet, and Wikidata proxies resource level for LOD KGs.
    Basis for the proposed categorization and definition.

pith-pipeline@v0.9.0 · 9450 in / 930 out tokens · 87460 ms · 2026-05-08T10:58:23.659833+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

4 extracted references · 4 canonical work pages

  1. [1]

    Ethics Inf

    Helm, P., Bella, G., Koch, G., Giunchiglia, F.: Diversity and language technology: how language modeling bias causes epistemic injustice. Ethics Inf. Technol.26(1), 8 (2024). https://doi.org/10.1007/S10676-023-09742-6

  2. [2]

    The State and Fate of Linguistic Diversity and Inclusion in the NLP World

    Joshi, P., Santy, S., Budhiraja, A., Bali, K., Choudhury, M.: The state and fate of linguistic diversity and inclusion in the NLP world. In: Proceedings of the 58th An- nual Meeting of the Association for Computational Linguistics, ACL 2020, Online, July 5-10, 2020. pp. 6282–6293. Association for Computational Linguistics (2020). https://doi.org/10.18653/...

  3. [3]

    Hellina Hailu Nigatu, Atnafu Lambebo Tonja, Benjamin Rosman, Thamar Solorio, and Monojit Choudhury

    Nigatu, H.H., Tonja, A.L., Rosman, B., Solorio, T., Choudhury, M.: The zeno’s paradox of ’low-resource’ languages. In: Proceedings of the 2024 Conference on Em- pirical Methods in Natural Language Processing, EMNLP 2024, Miami, FL, USA, November 12-16, 2024. pp. 17753–17774. Association for Computational Linguistics (2024). https://doi.org/10.18653/V1/202...

  4. [4]

    In: Proceedings of the Thirteenth Language Re- sources and Evaluation Conference

    V¯iksna, R., Skadin,a, I., Skadin,š, R., Vasil,jevs, A., Rozis, R.: Assessing multilingual- ity of publicly accessible websites. In: Proceedings of the Thirteenth Language Re- sources and Evaluation Conference. pp. 2108–2116. European Language Resources Association (Jun 2022)