An audit of over twenty African NLP corpus families documents license incompatibilities, hidden restrictions, and data persistence failures via a six-tier matrix applied to three languages.
Open but Incompatible: A License Compatibility Analysis of Corpora for Low-Resource African Languages
1 Pith paper cite this work. Polarity classification is still indexing.
abstract
Creative Commons licenses dominate African NLP corpus releases, but their compatibility rules are rarely applied. CC-BY-SA and CC-BY-NC cannot be combined in a single published dataset; a NoDerivs clause silently prohibits tokenisation and annotation. This paper audits the license provenance of over twenty corpus families used in African NLP, constructs a six-tier compatibility matrix, and applies it to three case-study languages: Kituba/Munukutuba, Zarma, and Moore. Four failure modes are documented with primary-source evidence: outright prohibition (JW300, removed from OPUS after a legal audit confirmed Terms of Service violation); composite license misrepresentation (WAXAL, whose CC-BY 4.0 claim is contradicted by its own HuggingFace dataset card); a NoDerivs clause hidden behind a CC-BY label (Tanzil); and data persistence failure (the Congolese Radio Corpus, where 402 of 405 source URLs are now dead). A pre-annotation due diligence checklist and a survey of legally clean enrichment opportunities close the paper.
fields
cs.CL 1years
2026 1verdicts
UNVERDICTED 1representative citing papers
citing papers explorer
-
Open but Incompatible: A License Compatibility Analysis of Corpora for Low-Resource African Languages
An audit of over twenty African NLP corpus families documents license incompatibilities, hidden restrictions, and data persistence failures via a six-tier matrix applied to three languages.