Which Are the Low-Resource Languages of the Semantic Web?

Fabien Gandon (WIMMICS); Miguel Couceiro (INESC-ID); Ndeye-Emilie Mbengue (WIMMICS); Pierre Monnin (WIMMICS)

arxiv: 2605.05929 · v1 · submitted 2026-05-07 · 💻 cs.AI

Which Are the Low-Resource Languages of the Semantic Web?

Ndeye-Emilie Mbengue (WIMMICS) , Pierre Monnin (WIMMICS) , Miguel Couceiro (INESC-ID) , Fabien Gandon (WIMMICS) This is my paper

Pith reviewed 2026-05-08 10:58 UTC · model grok-4.3

classification 💻 cs.AI

keywords low-resource languagesSemantic WebLinked Open Datacross-lingual transferDBpediaBabelNetWikidatalanguage distribution

0 comments

The pith

A methodology using DBpedia, BabelNet and Wikidata defines low-, medium- and high-resource languages for Linked Open Data.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops a way to measure how well different languages are represented in major multilingual knowledge graphs on the Semantic Web. By examining language distributions in DBpedia, BabelNet, and Wikidata, the authors create categories that label languages as low-, medium-, or high-resource. This matters because clear definitions are needed to choose which languages can benefit from cross-lingual transfer techniques to reduce the digital divide. Without them, efforts to make open data accessible across languages stay informal and hard to scale.

Core claim

The authors present a methodology to analyze the distribution of languages across LOD KGs and propose a preliminary multi-level categorization based on DBpedia, BabelNet, and Wikidata. This categorization brings a formal definition of low-, high-, and medium-resource languages that could be leveraged to select cross-lingual transfer candidates.

What carries the argument

The multi-level categorization of languages derived from their presence and distribution statistics in DBpedia, BabelNet, and Wikidata, which serves as a quantitative proxy for resource levels in LOD KGs.

Load-bearing premise

That the language distribution statistics from DBpedia, BabelNet, and Wikidata serve as a valid measure of resource availability for cross-lingual transfer in the Semantic Web overall.

What would settle it

Finding a language that has very low representation in DBpedia, BabelNet, and Wikidata yet supports successful cross-lingual transfer using other LOD sources would challenge the definitions.

Figures

Figures reproduced from arXiv: 2605.05929 by Fabien Gandon (WIMMICS), Miguel Couceiro (INESC-ID), Ndeye-Emilie Mbengue (WIMMICS), Pierre Monnin (WIMMICS).

**Figure 1.** Figure 1: Language coverage (log-log) in BabelNet, Wikidata, DBpedia, and in view at source ↗

**Figure 2.** Figure 2: Language coverage (log-log) in the aggregated LOD KGs. The x-axis view at source ↗

read the original abstract

Emerging digital technologies are exacerbating the existing divide in Open Access Data (OAD) between high-and low-resource languages, excluding many communities from the global digital transformation. Multilingual Linked Open Data Knowledge Graphs (LOD KGs) could contribute to mitigating this divide through cross-lingual transfer; however, no clear quantitative definition of low-resource languages has yet been established in the context of LOD KGs. In this poster, we present a methodology to analyze the distribution of languages across LOD KGs and propose a preliminary multi-level categorization based on DBpedia, BabelNet, and Wikidata. This categorization is leveraged to bring a formal definition of low-, high-, and medium-resource languages that could be later leveraged to select cross-lingual transfer candidates.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This poster gives a first concrete categorization of low-resource languages in LOD by analyzing distributions in DBpedia, BabelNet, and Wikidata, but keeps everything preliminary with few specifics shown.

read the letter

The main point of this poster is a methodology for checking language presence across three big LOD graphs and turning the results into a multi-level split with a formal definition of low-, medium-, and high-resource languages for the Semantic Web context. They position the definition as something that could later help pick languages for cross-lingual transfer work. That fills a gap they note, since no prior quantitative version existed in LOD specifically. Using real counts from DBpedia, BabelNet, and Wikidata grounds the idea instead of leaving it purely conceptual, which is a clear step forward for anyone trying to address the language divide in open data. The framing around digital inclusion and transfer candidates is straightforward and practical. The soft spots are mostly about what is missing rather than what is wrong. The abstract and poster format give no thresholds, no actual distribution numbers, and no test of whether the categories predict anything useful about transfer. That leaves the definition as a starting sketch rather than a tested tool. Relying on just those three graphs as the proxy is reasonable for now but could be narrow if other LOD sources differ. This is aimed at Semantic Web people working on multilingual KGs or resource-aware applications. A reader who needs a baseline way to label languages in this domain can take the categorization and build on it. The paper shows clear thinking on a real subfield problem without internal contradictions or circular claims. It deserves peer review so the authors can add the analysis details and get feedback on the cutoffs. I would send it to referees rather than desk reject.

Referee Report

1 major / 1 minor

Summary. The paper presents a methodology to analyze the distribution of languages across Linked Open Data Knowledge Graphs (LOD KGs), specifically using DBpedia, BabelNet, and Wikidata. It proposes a preliminary multi-level categorization of languages based on this analysis and derives a formal definition of low-, medium-, and high-resource languages intended to support selection of candidates for cross-lingual transfer.

Significance. If the methodology and resulting definitions are sound, the work addresses a genuine gap by providing an empirical, data-driven starting point for quantifying language resources in the Semantic Web. This could aid future efforts in cross-lingual transfer and help mitigate digital divides. The use of multiple established KGs (DBpedia, BabelNet, Wikidata) for the analysis is a positive aspect that supports robustness and potential reproducibility.

major comments (1)

[Methodology and Results sections] The manuscript describes the intent to present a methodology and formal definitions but provides no specific details on analysis methods, thresholds for categorization, or quantitative results (e.g., language counts or distribution statistics per KG). This makes it impossible to assess whether the proposed definitions are well-supported or reproducible.

minor comments (1)

[Abstract] The abstract and introduction could include at least one concrete example of a language categorized as low-resource with supporting statistics from the KGs to illustrate the approach.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback and positive evaluation of the work's significance. We address the single major comment below and will revise the manuscript accordingly.

read point-by-point responses

Referee: [Methodology and Results sections] The manuscript describes the intent to present a methodology and formal definitions but provides no specific details on analysis methods, thresholds for categorization, or quantitative results (e.g., language counts or distribution statistics per KG). This makes it impossible to assess whether the proposed definitions are well-supported or reproducible.

Authors: We agree that the current poster version lacks the quantitative details and explicit thresholds needed for full assessment and reproducibility. The poster format limited space for these elements. In the revised manuscript we will add: (1) a description of the exact analysis methods (e.g., how language distributions were extracted and normalized across the three KGs), (2) the concrete thresholds and criteria used for the multi-level categorization, and (3) summary statistics including language counts and distribution figures per KG. These additions will be placed in expanded Methodology and Results sections. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper presents an empirical methodology to analyze language distributions across three external LOD KGs (DBpedia, BabelNet, Wikidata) and derives a preliminary multi-level categorization yielding a formal definition of low-/medium-/high-resource languages. This definition is explicitly framed as a starting point for future use rather than a validated or derived result. No equations, fitted parameters, self-definitional claims, or load-bearing self-citations appear in the derivation chain; all inputs are independent external data sources. The central contribution is self-contained against external benchmarks with no reduction to its own inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Central claim rests on domain assumption that KG coverage proxies resource level; no free parameters or invented entities.

axioms (1)

domain assumption Language distribution in DBpedia, BabelNet, and Wikidata proxies resource level for LOD KGs.
Basis for the proposed categorization and definition.

pith-pipeline@v0.9.0 · 9450 in / 930 out tokens · 87460 ms · 2026-05-08T10:58:23.659833+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

4 extracted references · 4 canonical work pages

[1]

Ethics Inf

Helm, P., Bella, G., Koch, G., Giunchiglia, F.: Diversity and language technology: how language modeling bias causes epistemic injustice. Ethics Inf. Technol.26(1), 8 (2024). https://doi.org/10.1007/S10676-023-09742-6

work page doi:10.1007/s10676-023-09742-6 2024
[2]

The State and Fate of Linguistic Diversity and Inclusion in the NLP World

Joshi, P., Santy, S., Budhiraja, A., Bali, K., Choudhury, M.: The state and fate of linguistic diversity and inclusion in the NLP world. In: Proceedings of the 58th An- nual Meeting of the Association for Computational Linguistics, ACL 2020, Online, July 5-10, 2020. pp. 6282–6293. Association for Computational Linguistics (2020). https://doi.org/10.18653/...

work page doi:10.18653/v1/2020.acl-main.560 2020
[3]

Hellina Hailu Nigatu, Atnafu Lambebo Tonja, Benjamin Rosman, Thamar Solorio, and Monojit Choudhury

Nigatu, H.H., Tonja, A.L., Rosman, B., Solorio, T., Choudhury, M.: The zeno’s paradox of ’low-resource’ languages. In: Proceedings of the 2024 Conference on Em- pirical Methods in Natural Language Processing, EMNLP 2024, Miami, FL, USA, November 12-16, 2024. pp. 17753–17774. Association for Computational Linguistics (2024). https://doi.org/10.18653/V1/202...

work page doi:10.18653/v1/2024.emnlp-main.983 2024
[4]

In: Proceedings of the Thirteenth Language Re- sources and Evaluation Conference

V¯iksna, R., Skadin,a, I., Skadin,š, R., Vasil,jevs, A., Rozis, R.: Assessing multilingual- ity of publicly accessible websites. In: Proceedings of the Thirteenth Language Re- sources and Evaluation Conference. pp. 2108–2116. European Language Resources Association (Jun 2022)

work page 2022

[1] [1]

Ethics Inf

Helm, P., Bella, G., Koch, G., Giunchiglia, F.: Diversity and language technology: how language modeling bias causes epistemic injustice. Ethics Inf. Technol.26(1), 8 (2024). https://doi.org/10.1007/S10676-023-09742-6

work page doi:10.1007/s10676-023-09742-6 2024

[2] [2]

The State and Fate of Linguistic Diversity and Inclusion in the NLP World

Joshi, P., Santy, S., Budhiraja, A., Bali, K., Choudhury, M.: The state and fate of linguistic diversity and inclusion in the NLP world. In: Proceedings of the 58th An- nual Meeting of the Association for Computational Linguistics, ACL 2020, Online, July 5-10, 2020. pp. 6282–6293. Association for Computational Linguistics (2020). https://doi.org/10.18653/...

work page doi:10.18653/v1/2020.acl-main.560 2020

[3] [3]

Hellina Hailu Nigatu, Atnafu Lambebo Tonja, Benjamin Rosman, Thamar Solorio, and Monojit Choudhury

Nigatu, H.H., Tonja, A.L., Rosman, B., Solorio, T., Choudhury, M.: The zeno’s paradox of ’low-resource’ languages. In: Proceedings of the 2024 Conference on Em- pirical Methods in Natural Language Processing, EMNLP 2024, Miami, FL, USA, November 12-16, 2024. pp. 17753–17774. Association for Computational Linguistics (2024). https://doi.org/10.18653/V1/202...

work page doi:10.18653/v1/2024.emnlp-main.983 2024

[4] [4]

In: Proceedings of the Thirteenth Language Re- sources and Evaluation Conference

V¯iksna, R., Skadin,a, I., Skadin,š, R., Vasil,jevs, A., Rozis, R.: Assessing multilingual- ity of publicly accessible websites. In: Proceedings of the Thirteenth Language Re- sources and Evaluation Conference. pp. 2108–2116. European Language Resources Association (Jun 2022)

work page 2022